Debate Helps Weak-to-Strong Generalization
Hao Lang, Fei Huang, Yongbin Li — 2025-01-21 — arXiv
Summary
Empirically tests whether debate can improve weak-to-strong generalization by having strong models assist weak models during training, then using those enhanced weak models to supervise the strong models on OpenAI’s weak-to-strong NLP benchmarks.
Key Result
Debate enables weak models to extract trustworthy information from strong models for supervision, and ensembles of weak models better exploit debate arguments, leading to improved alignment on weak-to-strong benchmarks.
Source
- Link: https://arxiv.org/abs/2501.13124
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- black-box-make-ai-solve-it — Black-box safety (understand and control current model behaviour) / Iterative alignment
- weak-to-strong-generalization — Make AI solve it