Neural Interactive Proofs
Lewis Hammond, Sam Adam-Day — 2024-12-08 — University of Oxford — ICLR 2025
Summary
Introduces neural interactive proofs - a game-theoretic framework enabling trusted weak models to interact with untrusted strong models to solve tasks beyond the weak model’s capabilities, with several new protocols (NIP, MNIP, zk-variants) and empirical evaluation on graph isomorphism and code validation tasks.
Key Result
New protocols (NIP and MNIP) outperform prior scalable oversight methods including debate and Merlin-Arthur classifiers on untrained models, though differences diminish with expert iteration training.
Source
- Link: https://neural-interactive-proofs.com/
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- black-box-make-ai-solve-it — Black-box safety (understand and control current model behaviour) / Iterative alignment