Can Performant LLMs Be Ethical?
Developing Compliant LLM Training Data Respecting Data Owners Opt-out
๐ฏ The Challenge
As more content owners opt-out of web crawling for AI training, a critical question emerges: Can we build high-performing language models while respecting data usage restrictions?
๐ ๏ธ Contribution
We provide open-source tools and datasets to help the AI community.
- Check URL compliance and filter training data to respect robots.txt restrictions
- Convenient document-wise compliance status of popular datasets (FineWeb, FineWeb2, FineWeb-Edu)
- 487K+ English and 333K+ multilingual domains that restrict AI crawlers
๐ Our Research
We introduce the Data Compliance Gap (DCG), a metric that quantifies the performance difference between:
- โ Compliant models, trained only on data that respects robots.txt opt-outs.
- โ Non-compliant models, trained on all available web data.
๐ Key Findings
General Knowledge. Close to 0% DCG, LLMs can achieve comparable performance using only openly available data.
Specialized Domains. Noticeable DCG in areas like:
- ๐ฅ Biomedical research
- ๐ฐ Structural knowledge formats
- ๐ก๏ธ Robustness against adversarial examples
Available Resources
๐ซ Blocked Domain Lists
Comprehensive lists of restricted domains
- ๐ฌ๐ง 487K+ English domains
- ๐ 333K+ Multilingual domains
๐ง Robots.txt Compliance Checker
Check any URL of your interest whether it's compliant
- ๐๏ธ Retrospective compliance
๐๏ธ Compliance Tagging
Instance-wise compliance tag for FineWeb family datasets
- ๐ท FineWeb
- ๐ฅ FineWeb2
- ๐ FineWeb-Edu