Can Performant LLMs Be Ethical?

Developing Compliant LLM Training Data Respecting Data Owners Opt-out

🎯 The Challenge

As more content owners opt-out of web crawling for AI training, a critical question emerges: Can we build high-performing language models while respecting data usage restrictions?

🛠️ Contribution

We provide open-source tools and datasets to help the AI community.

Check URL compliance and filter training data to respect robots.txt restrictions
Convenient document-wise compliance status of popular datasets (FineWeb, FineWeb2, FineWeb-Edu)
487K+ English and 333K+ multilingual domains that restrict AI crawlers

📊 Our Research

We introduce the Data Compliance Gap (DCG), a metric that quantifies the performance difference between:

✅ Compliant models, trained only on data that respects robots.txt opt-outs.
❌ Non-compliant models, trained on all available web data.

🔍 Key Findings

General Knowledge. Close to 0% DCG, LLMs can achieve comparable performance using only openly available data.

Specialized Domains. Noticeable DCG in areas like:

🏥 Biomedical research
📰 Structural knowledge formats
🛡️ Robustness against adversarial examples

Available Resources

🚫 Blocked Domain Lists

Comprehensive lists of restricted domains

🇬🇧 487K+ English domains
🌍 333K+ Multilingual domains

Access Lists →

🧐 Robots.txt Compliance Checker

Check any URL of your interest whether it's compliant

🗓️ Retrospective compliance

Install the checker →

🗂️ Compliance Tagging

Instance-wise compliance tag for FineWeb family datasets

🍷 FineWeb
🥂 FineWeb2
📚 FineWeb-Edu

Browse Tags →

Get Started with Tool →

Can Performant LLMs Be Ethical?#

🎯 The Challenge#

🛠️ Contribution#

📊 Our Research#

🔍 Key Findings#

Available Resources#