We provide a comprehensive tool to help the AI community filter training data in compliance with robots.txt restrictions. Our tool is designed to be easy to use while ensuring respect for content creators’ wishes via robots.txt.
Why Use This Tool?
- Ethical AI Development: Build models that respect content creators’ rights
- Legal Compliance: Avoid potential copyright issues in your training data
- Transparency: Know exactly what data you’re using
Retrospective Compliance Filtering
We first rank domains based on the amount of data they provide to the corpus. Then, we select the top 1 million English domains and the top 1 million non-English domains. Some of these domains are now offline, and there is some overlap between the two lists. As a result, the total number of robots.txt files we could successfully download is much less than 2 million.
For every domain that is still reachable we retrieve the robots.txt file as of January 2025 and evaluate the rules that are relevant for AI training. Specifically, we look at the directives that apply to the AI-specific user agents listed below.
"AI2Bot", # AI2
"Applebot-Extended", # Apple
"Bytespider", # Bytedance
"CCBot", # Common Crawl
"CCBot/2.0", # Common Crawl
"CCBot/1.0", # Common Crawl
"ClaudeBot", # Anthropic
"cohere-training-data-crawler", # Cohere
"Diffbot", # Diffbot
"Meta-ExternalAgent", # Meta
"Google-Extended", # Google
"GPTBot", # OpenAI
"PanguBot", # Huawei
"*"
🚫 Pre-filtered Domain Lists
We provide curated lists of URL domains that restrict AI crawlers in their robots.txt files. If any subdomain of a given domain disallows crawling by any AI training user agent, that domain is included in the list.
🇬🇧 English Corpus
Download English Blocked Domains 🤗 - 487K+ blocked domains
🌍 Multilingual Corpus
Download Multilingual Blocked Domains 🤗 - 333K+ blocked domains
🔍 URL Compliance Checker
A single domain can host many sub‑domains—and each one may follow a different robots.txt policy. Our robots-checker
package lets you zoom in to the exact URL and instantly see whether it plays by the rules.
Installation
pip install robots-checker==1.2.3
Usage
import url_checker
checker = url_checker.RobotsTxtComplianceChecker()
status = checker.is_compliant("https://blog.example.com/some-page")
print(status) # ➜ "Compliant" or "NonCompliant"
Additionally, we offer fine-grained data filtering codes using the checker in our github, which is based on Datatrove.
🗂️ Document-wise Compliance Tag
Due to data distribution restrictions, we are unable to directly upload the filterd dataset, however, we provide a document-wise compliance tag for easy filtering, using the robots-checker
. For each document in the FineWeb family, we provide a compliance tag indicating either true or false. True means that the document is from a compliant source.
📚 FineWeb-Edu Compliant Tag
swiss-ai/fineweb-edu-compliant-tag 🤗
- Base Dataset: HuggingFaceFW/fineweb-edu 🤗
📄 Citation
@inproceedings{fan2025datacompliance,
title={Can Performant LLMs Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs},
author={Fan, Dongyang and Sabolčec, Vinko and Ansaripour, Matin and
Tarun, Ayush Kumar and Jaggi, Martin and Bosselut, Antoine and Schlag, Imanol},
booktitle={Conference on Language Modeling (COLM)},
year={2025}
}