Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision
Yaowen Ye, Cassidy Laidlaw, Jacob Steinhardt — 2025-01-14 — UC Berkeley — arXiv
Summary
Proposes Iterative Label Refinement (ILR) as an alternative to RLHF for aligning language models under unreliable supervision, using comparison feedback to improve training data quality rather than directly training the model, and demonstrates superior performance over DPO on math, coding, and safe instruction-following tasks.
Key Result
SFT+ILR outperforms SFT+DPO on tasks with unreliable supervision, showing that directing feedback toward improving training data is more effective than continual model training when human supervision is unreliable.
Source
- Link: https://arxiv.org/abs/2501.07886
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- iterative-alignment-at-post-train-time — Black-box safety (understand and control current model behaviour) / Iterative alignment