What’s In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Rajiv Movva, Smitha Milli, Sewon Min, Emma Pierson — 2025-10-30 — arXiv

Summary

Introduces WIMHF, a method using sparse autoencoders to extract interpretable features from human preference data, identifying what preferences datasets measure and what annotators actually express across 7 datasets.

Key Result

WIMHF identifies a small number of human-interpretable features accounting for majority of preference prediction signal, surfaces unsafe preferences (e.g., LMArena users voting against refusals for toxic content), and enables data curation yielding +37% safety gains.

Source

Link: https://arxiv.org/abs/2510.26202
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sparse-coding — White-box safety (i.e. Interpretability)

sparse-coding

AI Safety Compendium

Explorer

What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

What’s In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

What’s In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents