Dodging systematic human errors in scalable oversight
Geoffrey Irving — 2025-05-14 — UK AISI — LessWrong
Summary
Proposes formal modifications to debate protocols to handle systematic human errors, including a ‘Bob rejection’ mechanism where the adversarial debater can manipulate an ε-fraction of oracle responses and cross-examination protocols to make this computationally tractable.
Source
- Link: https://lesswrong.com/posts/EgRJtwQurNzz8CEfJ/dodging-systematic-human-errors-in-scalable-oversight
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- asymptotic-guarantees — Theory