Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

Xinpeng Wang, Nitish Joshi, Barbara Plank, Rico Angell, He He — 2025-10-01 — arXiv

Summary

Proposes TRACE (Truncated Reasoning AUC Evaluation), a method to detect implicit reward hacking by measuring how early in a model’s chain-of-thought reasoning it achieves high expected reward, with the insight that models taking shortcuts require less reasoning effort than those genuinely solving tasks.

Key Result

TRACE achieves over 65% gains over a 72B CoT monitor in math reasoning and over 30% gains over a 32B monitor in coding for detecting reward hacking, and can discover unknown loopholes during training.

Source