Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Xinpeng Wang, Nitish Joshi, Barbara Plank, Rico Angell, He He — 2025-10-01 — arXiv
Summary
Proposes TRACE (Truncated Reasoning AUC Evaluation), a method to detect implicit reward hacking by measuring how early in a model’s chain-of-thought reasoning it achieves high expected reward, with the insight that models taking shortcuts require less reasoning effort than those genuinely solving tasks.
Key Result
TRACE achieves over 65% gains over a 72B CoT monitor in math reasoning and over 30% gains over a 32B monitor in coding for detecting reward hacking, and can discover unknown loopholes during training.
Source
- Link: https://arxiv.org/abs/2510.01367
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- chain-of-thought-monitoring — Black-box safety (understand and control current model behaviour) / Iterative alignment