Debate Helps Weak-to-Strong Generalization

Hao Lang, Fei Huang, Yongbin Li — 2025-01-21 — arXiv

Summary

Empirically tests whether debate can improve weak-to-strong generalization by having strong models assist weak models during training, then using those enhanced weak models to supervise the strong models on OpenAI’s weak-to-strong NLP benchmarks.

Key Result

Debate enables weak models to extract trustworthy information from strong models for supervision, and ensembles of weak models better exploit debate arguments, leading to improved alignment on weak-to-strong benchmarks.

Source

Link: https://arxiv.org/abs/2501.13124
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- black-box-make-ai-solve-it — Black-box safety (understand and control current model behaviour) / Iterative alignment
- weak-to-strong-generalization — Make AI solve it

black-box-make-ai-solve-it
weak-to-strong-generalization

AI Safety Compendium

Explorer

Debate Helps Weak-to-Strong Generalization

Debate Helps Weak-to-Strong Generalization

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Debate Helps Weak-to-Strong Generalization

Debate Helps Weak-to-Strong Generalization

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents