Energy-Based Transformers are Scalable Learners and Thinkers

Alexi Gladstone, Ganesh Nanduru, Md Mofijul Islam, Peixuan Han, Hyeonjeong Ha, Aman Chadha, Yilun Du, Heng Ji, Jundong Li, Tariq Iqbal

LLMs & Reasoning Fri, Apr 24 · 4:15 PM–4:25 PM · Amphitheater Avg rating: 6.00 (2–8)

OpenReview ↗ arXiv ↗ PDF ↗ iclr.cc ↗

Author-provided TL;DR

We introduce Energy-Based Transformers, a scalable new approach for learning how to think from unsupervised learning, generalizing current System 2 Thinking/reasoning approaches.

Abstract

Inference-time computation, analogous to human System 2 Thinking, has recently become popular for improving model performance. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question “Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?” We find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs)---a new class of Energy-Based Models (EBMs)---to assign an energy value to every input and candidate-prediction, enabling predictions through energy minimization until convergence. To support this approach, we introduce several key techniques for stable and parallelizable training, which enable the emergence of strong System 2 Thinking capabilities and scalable EBMs. Across discrete and continuous modalities, we find EBTs outperform the Transformer++ approach, scaling up to 35% faster during pretraining, and improving inference-time performance by up to 29%. EBTs also surpass Diffusion Transformers on image denoising while requiring 99% fewer forward passes. Moreover, System 2 Thinking with EBTs yields larger performance gains on data that is farther out-of-distribution, and EBTs achieve better results than existing models on most downstream tasks despite achieving the same or worse pretraining performance, enabling EBTs to generalize better than existing approaches. Consequently, EBTs are a flexible and exciting new approach for scaling both the learning and thinking capabilities of models.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Common Corpus releases 2 trillion permissively-licensed tokens for open-science LLM pre-training covering diverse languages.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Largest open dataset for LLM pre-training with data uncopyrighted or under permissive licenses
Wide language coverage from high-resource European to low-resource languages rarely represented
Detailed provenance documentation enabling reproducible open-science LLM development

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Data curation and filtering
Language detection
Toxicity detection
OCR processing

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

Wikidata
Wikipedia
Open Culture datasets

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

2 trillion tokens suitable for pre-training limited-size models, larger models require significantly more data
from the paper
Does not contain data for instruction-tuning or specialized tasks, not directly suitable for task-specific fine-tuning
from the paper
Curation methods achieve high but not 100% accuracy; OCR errors and other issues may persist
from the paper
Does not cover entire range of available open data due to open data paradox
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Future collection of permissible open data highly encouraged
from the paper
Create ethical fine-tuning datasets leveraging multilingual, temporal and semantic diversity
from the paper

Author keywords

Energy-Based Models
System 2 Thinking
Reasoning
Verification
Scaling
Transformers
Generative Modeling

Something off? Let us know →

Energy-Based Transformers are Scalable Learners and Thinkers

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis