Is it Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Xinpeng Wang, Nitish Joshi, Barbara Plank, Rico Angell, He He
TRACE detects implicit reward hacking by measuring how quickly truncated reasoning suffices to pass verification, outperforming CoT monitoring and enabling hidden loopholes discovery.
Abstract
Reward hacking, where a reasoning model exploits loopholes in a reward function to achieve high rewards without solving the intended task, poses a significant threat. This behavior may be explicit, i.e. verbalized in the model's chain-of-thought (CoT), or implicit, where the CoT appears benign thus bypasses CoT monitors. To detect implicit reward hacking, we propose TRACE (Truncated Reasoning AUC Evaluation). Our key observation is that hacking occurs when exploiting the loophole is easier than solving the actual task. This means that the model is using less "effort" than required to achieve high reward. TRACE quantifies effort by measuring how early a model's reasoning becomes sufficient to obtain the reward. We progressively truncate a model's CoT at various lengths, force the model to answer, and estimate the expected reward at each cutoff. A hacking model, which takes a shortcut, will achieve a high expected reward with only a small fraction of its CoT, yielding a large area under the reward-vs-length curve. TRACE achieves over 65% gains over our strongest 72B CoT monitor in math reasoning, and over 30% gains over a 32B monitor in coding. We further show that TRACE can discover unknown loopholes during training. Overall, TRACE offers a scalable unsupervised approach for oversight where current monitoring methods prove ineffective.
Detects implicit reward hacking by measuring reasoning effort through truncated CoT analysis.
- Proposes TRACE method measuring reasoning effort by analyzing when model achieves reward
- Achieves over 65% gains detecting hacking over strongest CoT monitors in math reasoning
- Shows method can discover unknown loopholes during training via clustering TRACE scores
- Provides scalable unsupervised approach for oversight where current monitoring methods fail
- Chain-of-thought monitoring
- Reward hacking detection
Simulated loopholes are simplified, not capturing full complexity of real-world datasets
from the paperSynthetic code RM loopholes produce logically implausible solutions, easier for CoT monitors
from the paperMonitor capacity matters; larger monitors improve detection but asymmetry exists with stronger hacking models
from the paperMethod designed for reasoning tasks relying on inference-time exploration; single forward pass tasks problematic
from the paperOverthinking may inflate TRACE score, requiring calibration against clean questions
from the paper
Evaluate TRACE on more realistic, heterogeneous loopholes
from the paperInvestigate empirical impact of including TRACE signal in reward design
from the paperDevelop calibration approach against overthinking behavior
from the paper
Author keywords
- Reward Hacking Detection
- Chain-of-Thought Monitoring
- Reasoning Faithfulness
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.