What’s In My Human Feedback? Learning Interpretable Descriptions of Preference Data
Rajiv Movva, Smitha Milli, Sewon Min, Emma Pierson — 2025-10-30 — arXiv
Summary
Introduces WIMHF, a method using sparse autoencoders to extract interpretable features from human preference data, identifying what preferences datasets measure and what annotators actually express across 7 datasets.
Key Result
WIMHF identifies a small number of human-interpretable features accounting for majority of preference prediction signal, surfaces unsafe preferences (e.g., LMArena users voting against refusals for toxic content), and enables data curation yielding +37% safety gains.
Source
- Link: https://arxiv.org/abs/2510.26202
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sparse-coding — White-box safety (i.e. Interpretability)