Prover-Estimator Debate: A New Scalable Oversight Protocol
Jonah Brown-Cohen, Geoffrey Irving — 2025-06-17 — UK AISI — LessWrong / AI Alignment Forum
Summary
Presents a novel debate protocol where a prover decomposes claims and an estimator provides probabilities, with formal proofs that honest behavior is incentivized at equilibrium even when debaters have similar compute, addressing the obfuscated arguments problem in scalable oversight.
Key Result
Proves that under a stability assumption, the protocol incentivizes honest equilibrium behavior without requiring exponentially more compute for the honest debater, unlike prior debate protocols.
Source
- Link: https://lesswrong.com/posts/8XHBaugB5S3r27MG9/prover-estimator-debate-a-new-scalable-oversight-protocol
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- black-box-make-ai-solve-it — Black-box safety (understand and control current model behaviour) / Iterative alignment
- debate — Make AI solve it