MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, … (+1 more) — 2025-01-22 — Google DeepMind — arXiv

Summary

Proposes MONA (Myopic Optimization with Non-myopic Approval), a training method that combines short-sighted optimization with far-sighted reward to prevent multi-step reward hacking in RL agents, even when humans cannot detect the undesired behavior. Empirically demonstrates effectiveness across three settings modeling delegated oversight, encoded reasoning, and sensor tampering failure modes.

Key Result

MONA prevents multi-step reward hacking that ordinary RL causes, without requiring ability to detect the reward hacking or any extra information beyond what ordinary RL uses.

Source

Link: https://arxiv.org/abs/2501.13011
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- black-box-make-ai-solve-it — Black-box safety (understand and control current model behaviour) / Iterative alignment
- mild-optimisation — Black-box safety (understand and control current model behaviour) / Goal robustness

AI Safety Compendium

Explorer

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents