Can Performant LLMs Be Ethical?

Developing Compliant LLM Training Data Respecting Data Owners Opt-out
๐Ÿ“„ Paper ๐Ÿ™ GitHub ๐Ÿค— Hugging Face

๐ŸŽฏ The Challenge

As more content owners opt-out of web crawling for AI training, a critical question emerges: Can we build high-performing language models while respecting data usage restrictions?

๐Ÿ› ๏ธ Contribution

We provide open-source tools and datasets to help the AI community.

  • Check URL compliance and filter training data to respect robots.txt restrictions
  • Convenient document-wise compliance status of popular datasets (FineWeb, FineWeb2, FineWeb-Edu)
  • 487K+ English and 333K+ multilingual domains that restrict AI crawlers

๐Ÿ“Š Our Research

We introduce the Data Compliance Gap (DCG), a metric that quantifies the performance difference between:

๐Ÿ” Key Findings

General Knowledge. Close to 0% DCG, LLMs can achieve comparable performance using only openly available data.

Specialized Domains. Noticeable DCG in areas like:

Available Resources

๐Ÿšซ Blocked Domain Lists

Comprehensive lists of restricted domains

  • ๐Ÿ‡ฌ๐Ÿ‡ง 487K+ English domains
  • ๐ŸŒ 333K+ Multilingual domains
Access Lists โ†’

๐Ÿง Robots.txt Compliance Checker

Check any URL of your interest whether it's compliant

  • ๐Ÿ—“๏ธ Retrospective compliance
Install the checker โ†’

๐Ÿ—‚๏ธ Compliance Tagging

Instance-wise compliance tag for FineWeb family datasets

  • ๐Ÿท FineWeb
  • ๐Ÿฅ‚ FineWeb2
  • ๐Ÿ“š FineWeb-Edu
Browse Tags โ†’
Get Started with Tool โ†’