ICLR 2026 Orals

OpenThoughts: Data Recipes for Reasoning Models

Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Rea Sprague, Ashima Suvarna, Benjamin Feuer, Leon Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik sharma, Charlie Cheng-Jie Ji, Yichuan Deng, Sarah M Pratt, Vivek Ramanujan, Jon Saad-Falcon, Stutee Acharya, Jeffrey Li, Achal Dave, Alon Albalak, Kushal Arora, Blake Wulfe, Chinmay Hegde, Greg Durrett, Sewoong Oh, Mohit Bansal, Saadia Gabriel, Aditya Grover, Kai-Wei Chang, Vaishaal Shankar, Aaron Gokaslan, Mike A Merrill, Tatsunori Hashimoto, Yejin Choi, Jenia Jitsev, Reinhard Heckel, Maheswaran Sathiamoorthy, Alex Dimakis, Ludwig Schmidt

LLMs & Reasoning Fri, Apr 24 · 3:15 PM–3:25 PM · 204 A/B Avg rating: 6.50 (6–8)
Author-provided TL;DR

Data pipeline analysis for training reasoning models

Abstract

Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best train- ing recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. Our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data genera- tion pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Dia- mond – improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on openthoughts.ai.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

OpenThoughts releases open-source datasets and models for training reasoning tasks, achieving state-of-the-art on AIME and code benchmarks.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Creates open-source datasets for training reasoning models, addressing scarcity of public reasoning data
  • OpenThoughts2-1M dataset produces OpenThinker2-32B matching DeepSeek-R1-Distill-32B performance
  • Systematically investigates data generation pipeline with 1000+ experiments leading to OpenThoughts3-1.2M
  • OpenThinker3-7B achieves state-of-the-art: 53% AIME 2025, 51% LiveCodeBench, 54% GPQA Diamond
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Supervised fine-tuning
  • Data generation
  • Chain-of-thought reasoning
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • OpenThoughts2-1M
  • OpenThoughts3-1.2M
  • AIME 2025
  • LiveCodeBench
  • GPQA Diamond
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Did not explore datasets for reinforcement learning, standard training regime for reasoning models
    from the paper
  • Did not explore staged SFT or curriculum learning to further improve performance
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

  • Reasoning
  • Data
  • LLM

Related orals

Something off? Let us know →