MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, … (+1 more) — 2025-01-22 — Google DeepMind — arXiv
Summary
Proposes MONA (Myopic Optimization with Non-myopic Approval), a training method that combines short-sighted optimization with far-sighted reward to prevent multi-step reward hacking in RL agents, even when humans cannot detect the undesired behavior. Empirically demonstrates effectiveness across three settings modeling delegated oversight, encoded reasoning, and sensor tampering failure modes.
Key Result
MONA prevents multi-step reward hacking that ordinary RL causes, without requiring ability to detect the reward hacking or any extra information beyond what ordinary RL uses.
Source
- Link: https://arxiv.org/abs/2501.13011
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- black-box-make-ai-solve-it — Black-box safety (understand and control current model behaviour) / Iterative alignment
- mild-optimisation — Black-box safety (understand and control current model behaviour) / Goal robustness