MONA: Managed Myopia with Approval Feedback

Sebastian Farquhar, David Lindner, Rohin Shah — 2025-01-23 — DeepMind — AI Alignment Forum

Summary

Introduces MONA (Myopic Optimization with Non-myopic Approval), a training method that prevents multi-step reward hacking by combining completely myopic RL optimization with approval-based rewards that incorporate human foresight, demonstrated across three environments including test-driven development and steganographic reasoning.

Key Result

MONA successfully prevents multi-step reward hacking (agents writing softball tests, steganographic collusion) that occurs with standard RL, while maintaining useful performance on tasks without using multi-step scheming strategies.

Source