ICLR 2026 Orals

Is it Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

Xinpeng Wang, Nitish Joshi, Barbara Plank, Rico Angell, He He

LLMs & Reasoning Thu, Apr 23 · 3:27 PM–3:37 PM · 203 A/B Avg rating: 7.50 (6–8)
Author-provided TL;DR

TRACE detects implicit reward hacking by measuring how quickly truncated reasoning suffices to pass verification, outperforming CoT monitoring and enabling hidden loopholes discovery.

Abstract

Reward hacking, where a reasoning model exploits loopholes in a reward function to achieve high rewards without solving the intended task, poses a significant threat. This behavior may be explicit, i.e. verbalized in the model's chain-of-thought (CoT), or implicit, where the CoT appears benign thus bypasses CoT monitors. To detect implicit reward hacking, we propose TRACE (Truncated Reasoning AUC Evaluation). Our key observation is that hacking occurs when exploiting the loophole is easier than solving the actual task. This means that the model is using less "effort" than required to achieve high reward. TRACE quantifies effort by measuring how early a model's reasoning becomes sufficient to obtain the reward. We progressively truncate a model's CoT at various lengths, force the model to answer, and estimate the expected reward at each cutoff. A hacking model, which takes a shortcut, will achieve a high expected reward with only a small fraction of its CoT, yielding a large area under the reward-vs-length curve. TRACE achieves over 65% gains over our strongest 72B CoT monitor in math reasoning, and over 30% gains over a 32B monitor in coding. We further show that TRACE can discover unknown loopholes during training. Overall, TRACE offers a scalable unsupervised approach for oversight where current monitoring methods prove ineffective.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Detects implicit reward hacking by measuring reasoning effort through truncated CoT analysis.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Proposes TRACE method measuring reasoning effort by analyzing when model achieves reward
  • Achieves over 65% gains detecting hacking over strongest CoT monitors in math reasoning
  • Shows method can discover unknown loopholes during training via clustering TRACE scores
  • Provides scalable unsupervised approach for oversight where current monitoring methods fail
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Chain-of-thought monitoring
  • Reward hacking detection
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Simulated loopholes are simplified, not capturing full complexity of real-world datasets
    from the paper
  • Synthetic code RM loopholes produce logically implausible solutions, easier for CoT monitors
    from the paper
  • Monitor capacity matters; larger monitors improve detection but asymmetry exists with stronger hacking models
    from the paper
  • Method designed for reasoning tasks relying on inference-time exploration; single forward pass tasks problematic
    from the paper
  • Overthinking may inflate TRACE score, requiring calibration against clean questions
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Evaluate TRACE on more realistic, heterogeneous loopholes
    from the paper
  • Investigate empirical impact of including TRACE signal in reward design
    from the paper
  • Develop calibration approach against overthinking behavior
    from the paper

Author keywords

  • Reward Hacking Detection
  • Chain-of-Thought Monitoring
  • Reasoning Faithfulness

Related orals

Something off? Let us know →