Is it Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

Xinpeng Wang, Nitish Joshi, Barbara Plank, Rico Angell, He He

LLMs & Reasoning Thu, Apr 23 · 3:27 PM–3:37 PM · 203 A/B Avg rating: 7.50 (6–8)

Author-provided TL;DR

TRACE detects implicit reward hacking by measuring how quickly truncated reasoning suffices to pass verification, outperforming CoT monitoring and enabling hidden loopholes discovery.

Abstract

Reward hacking, where a reasoning model exploits loopholes in a reward function to achieve high rewards without solving the intended task, poses a significant threat. This behavior may be explicit, i.e. verbalized in the model's chain-of-thought (CoT), or implicit, where the CoT appears benign thus bypasses CoT monitors. To detect implicit reward hacking, we propose TRACE (Truncated Reasoning AUC Evaluation). Our key observation is that hacking occurs when exploiting the loophole is easier than solving the actual task. This means that the model is using less "effort" than required to achieve high reward. TRACE quantifies effort by measuring how early a model's reasoning becomes sufficient to obtain the reward. We progressively truncate a model's CoT at various lengths, force the model to answer, and estimate the expected reward at each cutoff. A hacking model, which takes a shortcut, will achieve a high expected reward with only a small fraction of its CoT, yielding a large area under the reward-vs-length curve. TRACE achieves over 65% gains over our strongest 72B CoT monitor in math reasoning, and over 30% gains over a 32B monitor in coding. We further show that TRACE can discover unknown loopholes during training. Overall, TRACE offers a scalable unsupervised approach for oversight where current monitoring methods prove ineffective.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Detects implicit reward hacking by measuring reasoning effort through truncated CoT analysis.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Proposes TRACE method measuring reasoning effort by analyzing when model achieves reward
Achieves over 65% gains detecting hacking over strongest CoT monitors in math reasoning
Shows method can discover unknown loopholes during training via clustering TRACE scores
Provides scalable unsupervised approach for oversight where current monitoring methods fail

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Chain-of-thought monitoring
Reward hacking detection

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Simulated loopholes are simplified, not capturing full complexity of real-world datasets
from the paper
Synthetic code RM loopholes produce logically implausible solutions, easier for CoT monitors
from the paper
Monitor capacity matters; larger monitors improve detection but asymmetry exists with stronger hacking models
from the paper
Method designed for reasoning tasks relying on inference-time exploration; single forward pass tasks problematic
from the paper
Overthinking may inflate TRACE score, requiring calibration against clean questions
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Evaluate TRACE on more realistic, heterogeneous loopholes
from the paper
Investigate empirical impact of including TRACE signal in reward design
from the paper
Develop calibration approach against overthinking behavior
from the paper

Author keywords

Reward Hacking Detection
Chain-of-Thought Monitoring
Reasoning Faithfulness

Something off? Let us know →

Is it Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis