Scaling Laws For Scalable Oversight
Joshua Engels, David D. Baek, Subhash Kantamneni, Max Tegmark — 2025-04-25 — MIT — arXiv (NeurIPS 2025 Spotlight)
Summary
Proposes and validates a framework that quantifies the probability of successful oversight as a function of overseer and overseen capabilities, modeling oversight as games between capability-mismatched players. Tests framework on Nim, Mafia, Debate, Backdoor Code, and Wargames, then derives optimal conditions for Nested Scalable Oversight.
Key Result
At a general Elo gap of 400, Nested Scalable Oversight success rates are 13.5% for Mafia, 51.7% for Debate, 10.0% for Backdoor Code, and 9.4% for Wargames; these rates decline further when overseeing stronger systems.
Source
- Link: https://arxiv.org/abs/2504.18530
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- weak-to-strong-generalization — Make AI solve it
- supervising-ais-improving-ais — Make AI solve it
- Editorial blurb (verbatim):
[Scaling Laws for Scalable Oversight](https://arxiv.org/abs/2504.18530)