MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, … (+1 more) — 2025-01-22 — Google DeepMind — arXiv

Summary

Proposes MONA (Myopic Optimization with Non-myopic Approval), a training method that combines short-sighted optimization with far-sighted reward to prevent multi-step reward hacking in RL agents, even when humans cannot detect the undesired behavior. Empirically demonstrates effectiveness across three settings modeling delegated oversight, encoded reasoning, and sensor tampering failure modes.

Key Result

MONA prevents multi-step reward hacking that ordinary RL causes, without requiring ability to detect the reward hacking or any extra information beyond what ordinary RL uses.

Source