ICLR 2026 Orals

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Pierre-Carl Langlais, Pavel Chizhov, Catherine Arnett, Carlos Rosas Hinostroza, Mattia Nee, Eliot Krzysztof Jones, Irène Girard, David Mach, Anastasia Stasenko, Ivan P. Yamshchikov

LLMs & Reasoning Fri, Apr 24 · 3:51 PM–4:01 PM · 204 A/B Avg rating: 7.00 (6–8)
Author-provided TL;DR

We assemble and release the largest truly open multilingual dataset for LLM pre-training consisting of 2 trillion tokens

Abstract

Large Language Models (LLMs) are pre-trained on large data from different sources and domains. These datasets often contain trillions of tokens, including large portions of copyrighted or proprietary content, which raises questions about the legal use of such models. This underscores the need for truly open pre-training data that complies with data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training. The data assembled in Common Corpus are either uncopyrighted or under permissive licenses and amount to about two trillion tokens. The dataset contains a wide variety of languages, ranging from the high-resource European languages to some low-resource languages rarely represented in pre-training datasets. In addition, it includes a large amount of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs in diverse areas of knowledge. In this paper, we present the detailed provenance of data assembling and the details of dataset filtering and curation. We train two small language models on Common Corpus and find that they perform comparably to other models of their size, indicating that our dataset is suitable for multilingual pretraining. Common Corpus represents a key contribution to the ecosystem for open science research on Large Language Models.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

WebDevJudge benchmark reveals significant LLM-as-judge gaps due to failures in functional equivalence and feasibility verification.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Comprehensive benchmark supporting both static code analysis and interactive agent navigation with preference labels
  • Structured rubrics and query-grounded annotations ensuring high-quality ground truth for web development evaluation
  • Systematic investigation of LLM, MLLM and agentic workflows revealing core bottlenecks hindering performance
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Large language models
  • Multimodal models
  • Agent-based evaluation
  • Web testing
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Annotation pool limited in size, labels may reflect collective biases of specific demographic rather than universal consensus
    from the paper
  • Relatively focused scale; broader set of filtered samples retained for future augmentation
    from the paper
  • Preference modeling inherently subjective; personal preferences difficult to decouple from objective quality
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Explore sophisticated agentic workflows and complex multi-round evaluations
    from the paper
  • Expand benchmark scale and scope
    from the paper

Author keywords

  • dataset
  • pre-training
  • large language models
  • open data
  • open science
  • multilingual

Related orals

Something off? Let us know →