Maximizing Signal in Human-Model Preference Alignment

Kelsey Kraus, Margaret Kroll — 2025-03-06 — arXiv

Summary

Proposes methodological best practices for disentangling noise from signal in human preference labeling tasks, with a case study evaluating two guardrails classifiers using improved human judgment methods to align model behavior with user preferences.

Key Result

Demonstrates that noise in labeling disagreement can be minimized through proven methodological best practices while maximizing signal for model training and evaluation in guardrails classification.

Source