Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?

Xueru Wen, Jie Lou, Yaojie Lu, Hongyu Lin, Xing Yu, Xinyu Lu, … (+4 more) — 2024-10-08 — arXiv (Accepted at ICLR 2025 Spotlight)

Summary

Empirically investigates whether reward model accuracy predicts downstream policy performance in RLHF, finding only weak correlation and significant variation in policy outcomes between RMs with similar accuracy scores.

Key Result

Policies optimized towards reward models with similar accuracy can exhibit quite different performance, and accuracy fails to fully capture potential reward model overoptimization through the lens of Regressional Goodhart effect.

Source