Great Models Think Alike and this Undermines AI Oversight
Shashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, … (+3 more) — 2025-02-06 — arXiv
Summary
Proposes CAPA (Chance Adjusted Probabilistic Agreement), a metric for measuring language model similarity based on overlap in mistakes, and uses it to study how model similarity affects AI oversight through LLM-as-a-judge evaluations and weak-to-strong generalization.
Key Result
Model mistakes become more similar with increasing capabilities, creating correlated failure risks for AI oversight, and LLM-as-a-judge scores exhibit systematic bias favoring models similar to the judge.
Source
- Link: https://arxiv.org/abs/2502.04313
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- weak-to-strong-generalization — Make AI solve it