Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?

Xueru Wen, Jie Lou, Yaojie Lu, Hongyu Lin, Xing Yu, Xinyu Lu, … (+4 more) — 2024-10-08 — arXiv (Accepted at ICLR 2025 Spotlight)

Summary

Empirically investigates whether reward model accuracy predicts downstream policy performance in RLHF, finding only weak correlation and significant variation in policy outcomes between RMs with similar accuracy scores.

Key Result

Policies optimized towards reward models with similar accuracy can exhibit quite different performance, and accuracy fails to fully capture potential reward model overoptimization through the lens of Regressional Goodhart effect.

Source

Link: https://arxiv.org/abs/2410.05584
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- rl-safety — Black-box safety (understand and control current model behaviour) / Goal robustness

rl-safety

AI Safety Compendium

Explorer

Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?

Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?

Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents