The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

Lukas Fluri, Leon Lang, Alessandro Abate, Patrick Forré, David Krueger, Joar Skalse — 2024-06-22 — arXiv

Summary

Mathematically proves that learned reward models can have low test error yet produce policies with high regret due to distributional shift during optimization, showing this error-regret mismatch persists even with policy regularization techniques like those used in RLHF.

Key Result

For any fixed expected test error, there exist realistic data distributions that allow error-regret mismatch to occur, even when using policy regularization.

Source

Link: https://arxiv.org/abs/2406.15753
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- rl-safety — Black-box safety (understand and control current model behaviour) / Goal robustness

rl-safety

AI Safety Compendium

Explorer

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents