ICLR 2026 Orals

Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, Yu Wang

Reinforcement Learning & Agents Thu, Apr 23 · 10:30 AM–10:40 AM · Amphitheater Avg rating: 6.50 (2–10)
Author-provided TL;DR

theoretical unveil the underlying limitations of length reward and propose D$^2$yOR to achieve supreme efficiency without performance degradation

Abstract

While large reasoning models trained with critic-free reinforcement learning and verifiable rewards (RLVR) represent the state-of-the-art, their practical utility is hampered by ``overthinking'', a critical issue where models generate excessively long reasoning paths without any performance benefit. Existing solutions that penalize length often fail, inducing performance degradation due to a fundamental misalignment between trajectory-level rewards and token-level optimization. In this work, we introduce a novel framework, DECS, built on our theoretical discovery of two previously unaddressed flaws in current length rewards: (1) the erroneous penalization of essential exploratory tokens and (2) the inadvertent rewarding of partial redundancy. Our framework's innovations include (i) a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens, and (ii) a novel curriculum batch scheduling strategy to master the efficiency-efficacy equilibrium. Experimental results show DECS can achieve a dramatic reduction in reasoning tokens by over 50\% across seven benchmarks while simultaneously maintaining or even improving performance. It demonstrates conclusively that substantial gains in reasoning efficiency can be achieved without compromising a model's underlying reasoning power. Code is available at \url{https://github.com/pixas/DECS}.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

DECS framework reduces reasoning model overthinking by decoupling necessary from redundant tokens via curriculum scheduling.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Decoupled token-level reward mechanism that distinguishes and penalizes redundant tokens while preserving exploratory reasoning
  • Curriculum batch scheduling strategy to balance reasoning efficiency and efficacy
  • Achieves 50% reduction in reasoning tokens across seven benchmarks while maintaining or improving performance
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Reinforcement learning with verifiable rewards (RLVR)
  • Curriculum learning
  • Token-level reward modeling
  • Auxiliary neural network (NRP detector)
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • AIME2024
  • AIME2025
  • AMC23
  • MATH500
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • NRP detector implemented as small auxiliary model adds 5.1% training overhead
    from the paper
  • Evaluation limited to models up to 7B parameters due to resource constraints
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Integrating NRP detection directly into policy via confidence or entropy signals instead of auxiliary model
    from the paper
  • Scaling method to larger architectures with adequate compute
    from the paper

Author keywords

  • efficient reasoning; curriculum sampling with decoupled reward

Related orals

Something off? Let us know →