Weak to Strong Generalization for Large Language Models with Multi-capabilities
Yucheng Zhou, Jianbing Shen, Yu Cheng — 2025-01-22 — ICLR 2025
Summary
Extends weak-to-strong generalization to multi-capability settings, finding that capabilities remain independent and proposing a two-stage training framework using reward models to select valuable weak supervision data while preventing performance degradation from strong model self-bootstrapping.
Key Result
Different capabilities remain relatively independent during weak-to-strong generalization, and the proposed reward model-based data selection framework significantly improves multi-capability weak-to-strong generalization performance compared to baseline methods.
Source
- Link: https://openreview.net/forum?id=N1vYivuSKq
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- black-box-make-ai-solve-it — Black-box safety (understand and control current model behaviour) / Iterative alignment