๐Ÿ‡จ๐Ÿ‡ญ Swiss AI Initiative

This project was developed as part of the Swiss AI Initiative to create compliant training data for Swiss AI models. Our goal is to ensure that Swiss AI development adheres to the highest ethical standards by respecting content creators' rights and web crawling restrictions.

The Swiss AI Initiative is the largest open science/open source effort for AI foundation models worldwide, leveraging the world's most AI-capable supercomputer Alps by the National Supercomputing Center (CSCS) with over 10,000 GPUs of the new NVIDIA Grace Hopper superchip.

Supported by a grant from the Swiss National Supercomputing Centre (CSCS) under project ID a06 on Alps, we provide the tools and datasets necessary for training performant yet ethical language models in compliance with Swiss values of privacy, transparency, and respect for intellectual property.

The Research Behind the Tool

This project was part of the research paper “Can Performant LLMs Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs” presented at COLM 2025, specifically developed to support the Swiss AI data pipeline.

๐Ÿ“š The Context

The success of Large Language Models (LLMs) heavily depends on web-scale data. However, as data becomes an increasingly valuable business asset, more content owners are restricting access through robots.txt files. This raises critical questions:

๐Ÿ“‘ What Does “Compliant” Mean?

We consider a dataset compliant if it was fetched according to the robots.txt rules in place at the time of crawling (CommonCrawl already follows this), and if the site still allows AI training today, which is the main focus of this work. To enforce this, we conduct a retrospective check that removes any domain whose robots.txt currently blocks any of the AI-specific crawlers.

๐Ÿ”ฌ Our Approach

We introduce the Data Compliance Gap (DCG), a metric that quantifies the performance difference between:

  1. Compliant models: Trained exclusively on data that respects robots.txt opt-outs
  2. Non-compliant models: Trained on all available web data

We measure DCG in two settings:

๐Ÿ“Œ Key Findings

Our experiments with 1.5B parameter models reveal:

General Knowledge โœ…

Near-zero DCG for models trained on fully open data perform comparably to those trained on all data. Even if all news publishers opt out, the impact remains minimal. Factual knowledge is often available from multiple sources and republications.

Specialized Domains โš ๏ธ

Notable performance gaps appear in:

Memorization Trade-offs โœ๏ธ

Compliant training reduces verbatim memorization of copyrighted content without necessarily compromising specific knowledge derived from the publications.

๐Ÿ’ก Implications

Our findings suggest that General-purpose LLMs can be trained ethically without sacrificing overall performance. Specialized applications may require careful consideration of data sources. Overall, we believe that the AI community needs tools to navigate compliance requirements effectively.

๐ŸŒŸ Why This Matters

As the debate around AI and copyright continues, empirical evidence is crucial. Our research provides data on the trade-offs between compliance and performance, helping inform AI development practices and standards for ethical AI training.

โš–๏ธ License


๐Ÿ“„ Read the full paper โ†’

๐Ÿ› ๏ธ Get started with the tool โ†’

๐Ÿ‘ฅ Meet the team โ†’