Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling
Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, Yu Wang
theoretical unveil the underlying limitations of length reward and propose D$^2$yOR to achieve supreme efficiency without performance degradation
Abstract
While large reasoning models trained with critic-free reinforcement learning and verifiable rewards (RLVR) represent the state-of-the-art, their practical utility is hampered by ``overthinking'', a critical issue where models generate excessively long reasoning paths without any performance benefit. Existing solutions that penalize length often fail, inducing performance degradation due to a fundamental misalignment between trajectory-level rewards and token-level optimization. In this work, we introduce a novel framework, DECS, built on our theoretical discovery of two previously unaddressed flaws in current length rewards: (1) the erroneous penalization of essential exploratory tokens and (2) the inadvertent rewarding of partial redundancy. Our framework's innovations include (i) a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens, and (ii) a novel curriculum batch scheduling strategy to master the efficiency-efficacy equilibrium. Experimental results show DECS can achieve a dramatic reduction in reasoning tokens by over 50\% across seven benchmarks while simultaneously maintaining or even improving performance. It demonstrates conclusively that substantial gains in reasoning efficiency can be achieved without compromising a model's underlying reasoning power. Code is available at \url{https://github.com/pixas/DECS}.
DECS framework reduces reasoning model overthinking by decoupling necessary from redundant tokens via curriculum scheduling.
- Decoupled token-level reward mechanism that distinguishes and penalizes redundant tokens while preserving exploratory reasoning
- Curriculum batch scheduling strategy to balance reasoning efficiency and efficacy
- Achieves 50% reduction in reasoning tokens across seven benchmarks while maintaining or improving performance
- Reinforcement learning with verifiable rewards (RLVR)
- Curriculum learning
- Token-level reward modeling
- Auxiliary neural network (NRP detector)
- AIME2024
- AIME2025
- AMC23
- MATH500
NRP detector implemented as small auxiliary model adds 5.1% training overhead
from the paperEvaluation limited to models up to 7B parameters due to resource constraints
from the paper
Integrating NRP detection directly into policy via confidence or entropy signals instead of auxiliary model
from the paperScaling method to larger architectures with adequate compute
from the paper
Author keywords
- efficient reasoning; curriculum sampling with decoupled reward
Related orals
Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning
SparseRL leverages deep RL and pretrained models to generate high-performance CUDA code for sparse matrix operations.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
MemAgent uses RL-trained memory modules to enable LLMs to extrapolate from 8K to 3.5M token contexts with minimal performance degradation.
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
DiffusionNFT enables efficient online reinforcement learning for diffusion models via forward process optimization with up to 25x efficiency gains.
Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport
Hyperparameter Trajectory Inference uses conditional Lagrangian optimal transport to reconstruct neural network outputs across hyperparameter spectra without expensive retraining.
Q-RAG: Long Context Multi‑Step Retrieval via Value‑Based Embedder Training
Q-RAG fine-tunes embedders for multi-step retrieval using reinforcement learning, achieving state-of-the-art on long-context QA.