Mitigating Goal Misgeneralization via Minimax Regret
Karim Abdel Sadek, Matthew Farrugia-Roberts, Usman Anwar, Hannah Erlebach, Christian Schroeder de Witt, David Krueger, … (+1 more) — 2025-07-03 — RLC 2025
Summary
Formalizes goal misgeneralization in RL and proves that minimax expected regret (MMER) objectives are robust to goal misgeneralization while maximum expected value (MEV) objectives are not, with empirical validation in grid-world environments.
Key Result
Minimax expected regret training prevents goal misgeneralization theoretically and shows greater robustness empirically compared to standard domain randomization with MEV objectives.
Source
- Link: https://arxiv.org/abs/2507.03068
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- rl-safety — Black-box safety (understand and control current model behaviour) / Goal robustness