MONA: Managed Myopia with Approval Feedback
Sebastian Farquhar, David Lindner, Rohin Shah — 2025-01-23 — DeepMind — AI Alignment Forum
Summary
Introduces MONA (Myopic Optimization with Non-myopic Approval), a training method that prevents multi-step reward hacking by combining completely myopic RL optimization with approval-based rewards that incorporate human foresight, demonstrated across three environments including test-driven development and steganographic reasoning.
Key Result
MONA successfully prevents multi-step reward hacking (agents writing softball tests, steganographic collusion) that occurs with standard RL, while maintaining useful performance on tasks without using multi-step scheming strategies.
Source
- Link: https://alignmentforum.org/posts/zWySWKuXnhMDhgwc3/mona-managed-myopia-with-approval-feedback-2
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- google-deepmind — Labs (giant companies)