Reward Model Interpretability via Optimal and Pessimal Tokens

Brian Christian, Hannah Rose Kirk, Jessica A.F. Thompson, Christopher Summerfield, Tsvetomira Dumbalska — 2025-06-08 — FAccT ‘25 (ACM Conference on Fairness, Accountability, and Transparency)

Summary

Systematically analyzes ten open-source reward models by exhaustively testing how they score every possible single-token response to value-laden prompts, revealing substantial heterogeneity between models, systematic asymmetries, sensitivity to prompt framing, and concerning identity biases.

Key Result

Reward models show substantial heterogeneity even when trained on similar objectives, exhibit systematic asymmetries between high- and low-scoring tokens, demonstrate sensitivity to prompt framing mirroring human cognitive biases, and encode concerning biases toward identity groups.

Source