Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision

Yaowen Ye, Cassidy Laidlaw, Jacob Steinhardt — 2025-01-14 — UC Berkeley — arXiv

Summary

Proposes Iterative Label Refinement (ILR) as an alternative to RLHF for aligning language models under unreliable supervision, using comparison feedback to improve training data quality rather than directly training the model, and demonstrates superior performance over DPO on math, coding, and safe instruction-following tasks.

Key Result

SFT+ILR outperforms SFT+DPO on tasks with unreliable supervision, showing that directing feedback toward improving training data is more effective than continual model training when human supervision is unreliable.

Source

Link: https://arxiv.org/abs/2501.07886
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- iterative-alignment-at-post-train-time — Black-box safety (understand and control current model behaviour) / Iterative alignment

iterative-alignment-at-post-train-time

AI Safety Compendium

Explorer

Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision

Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision

Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents