The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

Lukas Fluri, Leon Lang, Alessandro Abate, Patrick Forré, David Krueger, Joar Skalse — 2024-06-22 — arXiv

Summary

Mathematically proves that learned reward models can have low test error yet produce policies with high regret due to distributional shift during optimization, showing this error-regret mismatch persists even with policy regularization techniques like those used in RLHF.

Key Result

For any fixed expected test error, there exist realistic data distributions that allow error-regret mismatch to occur, even when using policy regularization.

Source