Maximizing Signal in Human-Model Preference Alignment
Kelsey Kraus, Margaret Kroll — 2025-03-06 — arXiv
Summary
Proposes methodological best practices for disentangling noise from signal in human preference labeling tasks, with a case study evaluating two guardrails classifiers using improved human judgment methods to align model behavior with user preferences.
Key Result
Demonstrates that noise in labeling disagreement can be minimized through proven methodological best practices while maximizing signal for model training and evaluation in guardrails classification.
Source
- Link: https://arxiv.org/abs/2503.04910
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- data-quality-for-alignment — Black-box safety (understand and control current model behaviour) / Better data