Weak to Strong Generalization for Large Language Models with Multi-capabilities

Yucheng Zhou, Jianbing Shen, Yu Cheng — 2025-01-22 — ICLR 2025

Summary

Extends weak-to-strong generalization to multi-capability settings, finding that capabilities remain independent and proposing a two-stage training framework using reward models to select valuable weak supervision data while preventing performance degradation from strong model self-bootstrapping.

Key Result

Different capabilities remain relatively independent during weak-to-strong generalization, and the proposed reward model-based data selection framework significantly improves multi-capability weak-to-strong generalization performance compared to baseline methods.

Source

Link: https://openreview.net/forum?id=N1vYivuSKq
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- black-box-make-ai-solve-it — Black-box safety (understand and control current model behaviour) / Iterative alignment

black-box-make-ai-solve-it

AI Safety Compendium

Explorer

Weak to Strong Generalization for Large Language Models with Multi-capabilities

Weak to Strong Generalization for Large Language Models with Multi-capabilities

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Weak to Strong Generalization for Large Language Models with Multi-capabilities

Weak to Strong Generalization for Large Language Models with Multi-capabilities

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents

Backlinks