What’s In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Rajiv Movva, Smitha Milli, Sewon Min, Emma Pierson — 2025-10-30 — arXiv

Summary

Introduces WIMHF, a method using sparse autoencoders to extract interpretable features from human preference data, identifying what preferences datasets measure and what annotators actually express across 7 datasets.

Key Result

WIMHF identifies a small number of human-interpretable features accounting for majority of preference prediction signal, surfaces unsafe preferences (e.g., LMArena users voting against refusals for toxic content), and enables data curation yielding +37% safety gains.

Source