Reward Model Interpretability via Optimal and Pessimal Tokens

Brian Christian, Hannah Rose Kirk, Jessica A.F. Thompson, Christopher Summerfield, Tsvetomira Dumbalska — 2025-06-08 — FAccT ‘25 (ACM Conference on Fairness, Accountability, and Transparency)

Summary

Systematically analyzes ten open-source reward models by exhaustively testing how they score every possible single-token response to value-laden prompts, revealing substantial heterogeneity between models, systematic asymmetries, sensitivity to prompt framing, and concerning identity biases.

Key Result

Reward models show substantial heterogeneity even when trained on similar objectives, exhibit systematic asymmetries between high- and low-scoring tokens, demonstrate sensitivity to prompt framing mirroring human cognitive biases, and encode concerning biases toward identity groups.

Source

Link: https://arxiv.org/abs/2506.07326
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- monitoring-concepts — White-box safety (i.e. Interpretability) / Concept-based interpretability

monitoring-concepts

AI Safety Compendium

Explorer

Reward Model Interpretability via Optimal and Pessimal Tokens

Reward Model Interpretability via Optimal and Pessimal Tokens

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Reward Model Interpretability via Optimal and Pessimal Tokens

Reward Model Interpretability via Optimal and Pessimal Tokens

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents