Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?
Xueru Wen, Jie Lou, Yaojie Lu, Hongyu Lin, Xing Yu, Xinyu Lu, … (+4 more) — 2024-10-08 — arXiv (Accepted at ICLR 2025 Spotlight)
Summary
Empirically investigates whether reward model accuracy predicts downstream policy performance in RLHF, finding only weak correlation and significant variation in policy outcomes between RMs with similar accuracy scores.
Key Result
Policies optimized towards reward models with similar accuracy can exhibit quite different performance, and accuracy fails to fully capture potential reward model overoptimization through the lens of Regressional Goodhart effect.
Source
- Link: https://arxiv.org/abs/2410.05584
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- rl-safety — Black-box safety (understand and control current model behaviour) / Goal robustness