Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Pierre-Carl Langlais, Pavel Chizhov, Catherine Arnett, Carlos Rosas Hinostroza, Mattia Nee, Eliot Krzysztof Jones, Irène Girard, David Mach, Anastasia Stasenko, Ivan P. Yamshchikov

LLMs & Reasoning Fri, Apr 24 · 3:51 PM–4:01 PM · 204 A/B Avg rating: 7.00 (6–8)

OpenReview ↗ arXiv ↗ PDF ↗ iclr.cc ↗

Author-provided TL;DR

We assemble and release the largest truly open multilingual dataset for LLM pre-training consisting of 2 trillion tokens

Abstract

Large Language Models (LLMs) are pre-trained on large data from different sources and domains. These datasets often contain trillions of tokens, including large portions of copyrighted or proprietary content, which raises questions about the legal use of such models. This underscores the need for truly open pre-training data that complies with data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training. The data assembled in Common Corpus are either uncopyrighted or under permissive licenses and amount to about two trillion tokens. The dataset contains a wide variety of languages, ranging from the high-resource European languages to some low-resource languages rarely represented in pre-training datasets. In addition, it includes a large amount of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs in diverse areas of knowledge. In this paper, we present the detailed provenance of data assembling and the details of dataset filtering and curation. We train two small language models on Common Corpus and find that they perform comparably to other models of their size, indicating that our dataset is suitable for multilingual pretraining. Common Corpus represents a key contribution to the ecosystem for open science research on Large Language Models.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

WebDevJudge benchmark reveals significant LLM-as-judge gaps due to failures in functional equivalence and feasibility verification.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Comprehensive benchmark supporting both static code analysis and interactive agent navigation with preference labels
Structured rubrics and query-grounded annotations ensuring high-quality ground truth for web development evaluation
Systematic investigation of LLM, MLLM and agentic workflows revealing core bottlenecks hindering performance

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Large language models
Multimodal models
Agent-based evaluation
Web testing

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Annotation pool limited in size, labels may reflect collective biases of specific demographic rather than universal consensus
from the paper
Relatively focused scale; broader set of filtered samples retained for future augmentation
from the paper
Preference modeling inherently subjective; personal preferences difficult to decouple from objective quality
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Explore sophisticated agentic workflows and complex multi-round evaluations
from the paper
Expand benchmark scale and scope
from the paper

Author keywords

dataset
pre-training
large language models
open data
open science
multilingual

Something off? Let us know →

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis