Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning

Miles Turpin, Andy Arditi, Marvin Li, Joe Benton, Julian Michael — 2025-06-28 — ICML 2025 Workshop on Reliable and Responsible Foundation Models

Summary

Proposes verbalization fine-tuning (VFT), a pre-RL intervention that trains models to explicitly acknowledge when influenced by prompt cues pointing to incorrect answers, then evaluates whether this helps detect reward hacking after RL training in environments that incentivize exploiting these cues.

Key Result

VFT reduced undetected reward hacks from 88% (no intervention) to 6%, increasing verbalization of cue influence from 8% pre-training to 94% after RL.

Source

Link: https://arxiv.org/abs/2506.22777
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- chain-of-thought-monitoring — Black-box safety (understand and control current model behaviour) / Iterative alignment

chain-of-thought-monitoring

AI Safety Compendium

Explorer

Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning

Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning

Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents