Scaling Laws For Scalable Oversight

Joshua Engels, David D. Baek, Subhash Kantamneni, Max Tegmark — 2025-04-25 — MIT — arXiv (NeurIPS 2025 Spotlight)

Summary

Proposes and validates a framework that quantifies the probability of successful oversight as a function of overseer and overseen capabilities, modeling oversight as games between capability-mismatched players. Tests framework on Nim, Mafia, Debate, Backdoor Code, and Wargames, then derives optimal conditions for Nested Scalable Oversight.

Key Result

At a general Elo gap of 400, Nested Scalable Oversight success rates are 13.5% for Mafia, 51.7% for Debate, 10.0% for Backdoor Code, and 9.4% for Wargames; these rates decline further when overseeing stronger systems.

Source