The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret
Lukas Fluri, Leon Lang, Alessandro Abate, Patrick Forré, David Krueger, Joar Skalse — 2024-06-22 — arXiv
Summary
Mathematically proves that learned reward models can have low test error yet produce policies with high regret due to distributional shift during optimization, showing this error-regret mismatch persists even with policy regularization techniques like those used in RLHF.
Key Result
For any fixed expected test error, there exist realistic data distributions that allow error-regret mismatch to occur, even when using policy regularization.
Source
- Link: https://arxiv.org/abs/2406.15753
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- rl-safety — Black-box safety (understand and control current model behaviour) / Goal robustness