ICLR 2026 Orals

Energy-Based Transformers are Scalable Learners and Thinkers

Alexi Gladstone, Ganesh Nanduru, Md Mofijul Islam, Peixuan Han, Hyeonjeong Ha, Aman Chadha, Yilun Du, Heng Ji, Jundong Li, Tariq Iqbal

LLMs & Reasoning Fri, Apr 24 · 4:15 PM–4:25 PM · Amphitheater Avg rating: 6.00 (2–8)
Author-provided TL;DR

We introduce Energy-Based Transformers, a scalable new approach for learning how to think from unsupervised learning, generalizing current System 2 Thinking/reasoning approaches.

Abstract

Inference-time computation, analogous to human System 2 Thinking, has recently become popular for improving model performance. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question “Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?” We find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs)---a new class of Energy-Based Models (EBMs)---to assign an energy value to every input and candidate-prediction, enabling predictions through energy minimization until convergence. To support this approach, we introduce several key techniques for stable and parallelizable training, which enable the emergence of strong System 2 Thinking capabilities and scalable EBMs. Across discrete and continuous modalities, we find EBTs outperform the Transformer++ approach, scaling up to 35% faster during pretraining, and improving inference-time performance by up to 29%. EBTs also surpass Diffusion Transformers on image denoising while requiring 99% fewer forward passes. Moreover, System 2 Thinking with EBTs yields larger performance gains on data that is farther out-of-distribution, and EBTs achieve better results than existing models on most downstream tasks despite achieving the same or worse pretraining performance, enabling EBTs to generalize better than existing approaches. Consequently, EBTs are a flexible and exciting new approach for scaling both the learning and thinking capabilities of models.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Common Corpus releases 2 trillion permissively-licensed tokens for open-science LLM pre-training covering diverse languages.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Largest open dataset for LLM pre-training with data uncopyrighted or under permissive licenses
  • Wide language coverage from high-resource European to low-resource languages rarely represented
  • Detailed provenance documentation enabling reproducible open-science LLM development
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Data curation and filtering
  • Language detection
  • Toxicity detection
  • OCR processing
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Wikidata
  • Wikipedia
  • Open Culture datasets
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • 2 trillion tokens suitable for pre-training limited-size models, larger models require significantly more data
    from the paper
  • Does not contain data for instruction-tuning or specialized tasks, not directly suitable for task-specific fine-tuning
    from the paper
  • Curation methods achieve high but not 100% accuracy; OCR errors and other issues may persist
    from the paper
  • Does not cover entire range of available open data due to open data paradox
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Future collection of permissible open data highly encouraged
    from the paper
  • Create ethical fine-tuning datasets leveraging multilingual, temporal and semantic diversity
    from the paper

Author keywords

  • Energy-Based Models
  • System 2 Thinking
  • Reasoning
  • Verification
  • Scaling
  • Transformers
  • Generative Modeling

Related orals

Something off? Let us know →