Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, Yu Wang

Reinforcement Learning & Agents Thu, Apr 23 · 10:30 AM–10:40 AM · Amphitheater Avg rating: 6.50 (2–10)

Author-provided TL;DR

theoretical unveil the underlying limitations of length reward and propose D$^2$yOR to achieve supreme efficiency without performance degradation

Abstract

While large reasoning models trained with critic-free reinforcement learning and verifiable rewards (RLVR) represent the state-of-the-art, their practical utility is hampered by ``overthinking'', a critical issue where models generate excessively long reasoning paths without any performance benefit. Existing solutions that penalize length often fail, inducing performance degradation due to a fundamental misalignment between trajectory-level rewards and token-level optimization. In this work, we introduce a novel framework, DECS, built on our theoretical discovery of two previously unaddressed flaws in current length rewards: (1) the erroneous penalization of essential exploratory tokens and (2) the inadvertent rewarding of partial redundancy. Our framework's innovations include (i) a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens, and (ii) a novel curriculum batch scheduling strategy to master the efficiency-efficacy equilibrium. Experimental results show DECS can achieve a dramatic reduction in reasoning tokens by over 50\% across seven benchmarks while simultaneously maintaining or even improving performance. It demonstrates conclusively that substantial gains in reasoning efficiency can be achieved without compromising a model's underlying reasoning power. Code is available at \url{https://github.com/pixas/DECS}.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

DECS framework reduces reasoning model overthinking by decoupling necessary from redundant tokens via curriculum scheduling.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Decoupled token-level reward mechanism that distinguishes and penalizes redundant tokens while preserving exploratory reasoning
Curriculum batch scheduling strategy to balance reasoning efficiency and efficacy
Achieves 50% reduction in reasoning tokens across seven benchmarks while maintaining or improving performance

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Reinforcement learning with verifiable rewards (RLVR)
Curriculum learning
Token-level reward modeling
Auxiliary neural network (NRP detector)

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

AIME2024
AIME2025
AMC23
MATH500

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

NRP detector implemented as small auxiliary model adds 5.1% training overhead
from the paper
Evaluation limited to models up to 7B parameters due to resource constraints
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Integrating NRP detection directly into policy via confidence or entropy signals instead of auxiliary model
from the paper
Scaling method to larger architectures with adequate compute
from the paper

Author keywords

efficient reasoning; curriculum sampling with decoupled reward

Something off? Let us know →

Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

Abstract

Author keywords

Related orals

Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport

Q-RAG: Long Context Multi‑Step Retrieval via Value‑Based Embedder Training