Mitigating Goal Misgeneralization via Minimax Regret

Karim Abdel Sadek, Matthew Farrugia-Roberts, Usman Anwar, Hannah Erlebach, Christian Schroeder de Witt, David Krueger, … (+1 more) — 2025-07-03 — RLC 2025

Summary

Formalizes goal misgeneralization in RL and proves that minimax expected regret (MMER) objectives are robust to goal misgeneralization while maximum expected value (MEV) objectives are not, with empirical validation in grid-world environments.

Key Result

Minimax expected regret training prevents goal misgeneralization theoretically and shows greater robustness empirically compared to standard domain randomization with MEV objectives.

Source

Link: https://arxiv.org/abs/2507.03068
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- rl-safety — Black-box safety (understand and control current model behaviour) / Goal robustness

rl-safety

AI Safety Compendium

Explorer

Mitigating Goal Misgeneralization via Minimax Regret

Mitigating Goal Misgeneralization via Minimax Regret

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Mitigating Goal Misgeneralization via Minimax Regret

Mitigating Goal Misgeneralization via Minimax Regret

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents