MONA: Managed Myopia with Approval Feedback

Sebastian Farquhar, David Lindner, Rohin Shah — 2025-01-23 — DeepMind — AI Alignment Forum

Summary

Introduces MONA (Myopic Optimization with Non-myopic Approval), a training method that prevents multi-step reward hacking by combining completely myopic RL optimization with approval-based rewards that incorporate human foresight, demonstrated across three environments including test-driven development and steganographic reasoning.

Key Result

MONA successfully prevents multi-step reward hacking (agents writing softball tests, steganographic collusion) that occurs with standard RL, while maintaining useful performance on tasks without using multi-step scheming strategies.

Source

Link: https://alignmentforum.org/posts/zWySWKuXnhMDhgwc3/mona-managed-myopia-with-approval-feedback-2
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- google-deepmind — Labs (giant companies)

google-deepmind

AI Safety Compendium

Explorer

MONA: Managed Myopia with Approval Feedback

MONA: Managed Myopia with Approval Feedback

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

MONA: Managed Myopia with Approval Feedback

MONA: Managed Myopia with Approval Feedback

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents