OpenThoughts: Data Recipes for Reasoning Models

Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Rea Sprague, Ashima Suvarna, Benjamin Feuer, Leon Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik sharma, Charlie Cheng-Jie Ji, Yichuan Deng, Sarah M Pratt, Vivek Ramanujan, Jon Saad-Falcon, Stutee Acharya, Jeffrey Li, Achal Dave, Alon Albalak, Kushal Arora, Blake Wulfe, Chinmay Hegde, Greg Durrett, Sewoong Oh, Mohit Bansal, Saadia Gabriel, Aditya Grover, Kai-Wei Chang, Vaishaal Shankar, Aaron Gokaslan, Mike A Merrill, Tatsunori Hashimoto, Yejin Choi, Jenia Jitsev, Reinhard Heckel, Maheswaran Sathiamoorthy, Alex Dimakis, Ludwig Schmidt

LLMs & Reasoning Fri, Apr 24 · 3:15 PM–3:25 PM · 204 A/B Avg rating: 6.50 (6–8)

OpenReview ↗ arXiv ↗ PDF ↗ iclr.cc ↗

Author-provided TL;DR

Data pipeline analysis for training reasoning models

Abstract

Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best train- ing recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. Our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data genera- tion pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Dia- mond – improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on openthoughts.ai.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

OpenThoughts releases open-source datasets and models for training reasoning tasks, achieving state-of-the-art on AIME and code benchmarks.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Creates open-source datasets for training reasoning models, addressing scarcity of public reasoning data
OpenThoughts2-1M dataset produces OpenThinker2-32B matching DeepSeek-R1-Distill-32B performance
Systematically investigates data generation pipeline with 1000+ experiments leading to OpenThoughts3-1.2M
OpenThinker3-7B achieves state-of-the-art: 53% AIME 2025, 51% LiveCodeBench, 54% GPQA Diamond

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Supervised fine-tuning
Data generation
Chain-of-thought reasoning

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

OpenThoughts2-1M
OpenThoughts3-1.2M
AIME 2025
LiveCodeBench
GPQA Diamond

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Did not explore datasets for reinforcement learning, standard training regime for reasoning models
from the paper
Did not explore staged SFT or curriculum learning to further improve performance
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

Reasoning
Data
LLM

Something off? Let us know →

OpenThoughts: Data Recipes for Reasoning Models

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis