Dodging systematic human errors in scalable oversight

Geoffrey Irving — 2025-05-14 — UK AISI — LessWrong

Summary

Proposes formal modifications to debate protocols to handle systematic human errors, including a ‘Bob rejection’ mechanism where the adversarial debater can manipulate an ε-fraction of oracle responses and cross-examination protocols to make this computationally tractable.

Source