Scaling Laws For Scalable Oversight

Joshua Engels, David D. Baek, Subhash Kantamneni, Max Tegmark — 2025-04-25 — MIT — arXiv (NeurIPS 2025 Spotlight)

Summary

Proposes and validates a framework that quantifies the probability of successful oversight as a function of overseer and overseen capabilities, modeling oversight as games between capability-mismatched players. Tests framework on Nim, Mafia, Debate, Backdoor Code, and Wargames, then derives optimal conditions for Nested Scalable Oversight.

Key Result

At a general Elo gap of 400, Nested Scalable Oversight success rates are 13.5% for Mafia, 51.7% for Debate, 10.0% for Backdoor Code, and 9.4% for Wargames; these rates decline further when overseeing stronger systems.

Source

Link: https://arxiv.org/abs/2504.18530
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- weak-to-strong-generalization — Make AI solve it
- supervising-ais-improving-ais — Make AI solve it
Editorial blurb (verbatim): [Scaling Laws for Scalable Oversight](https://arxiv.org/abs/2504.18530)

AI Safety Compendium

Explorer

Scaling Laws For Scalable Oversight

Scaling Laws For Scalable Oversight

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Scaling Laws For Scalable Oversight

Scaling Laws For Scalable Oversight

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents