Scalable Oversight for Superhuman AI via Recursive Self-Critiquing
Xueru Wen, Jie Lou, Xinyu Lu, Junjie Yang, Yanjiang Liu, Yaojie Lu, … (+2 more) — 2025-02-07 — arXiv
Summary
Investigates recursive self-critiquing as a method for scalable oversight of superhuman AI, testing the hypotheses that critique-of-critique is easier than direct critique and this relationship holds recursively through Human-AI and AI-AI experiments.
Key Result
Results demonstrate that recursive critique provides a promising and tractable supervision pathway when direct evaluation of superhuman AI outputs is infeasible.
Source
- Link: https://arxiv.org/abs/2502.04675
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- supervising-ais-improving-ais — Make AI solve it