Reward Model Interpretability via Optimal and Pessimal Tokens
Brian Christian, Hannah Rose Kirk, Jessica A.F. Thompson, Christopher Summerfield, Tsvetomira Dumbalska — 2025-06-08 — FAccT ‘25 (ACM Conference on Fairness, Accountability, and Transparency)
Summary
Systematically analyzes ten open-source reward models by exhaustively testing how they score every possible single-token response to value-laden prompts, revealing substantial heterogeneity between models, systematic asymmetries, sensitivity to prompt framing, and concerning identity biases.
Key Result
Reward models show substantial heterogeneity even when trained on similar objectives, exhibit systematic asymmetries between high- and low-scoring tokens, demonstrate sensitivity to prompt framing mirroring human cognitive biases, and encode concerning biases toward identity groups.
Source
- Link: https://arxiv.org/abs/2506.07326
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- monitoring-concepts — White-box safety (i.e. Interpretability) / Concept-based interpretability