Great Models Think Alike and this Undermines AI Oversight

Shashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, … (+3 more) — 2025-02-06 — arXiv

Summary

Proposes CAPA (Chance Adjusted Probabilistic Agreement), a metric for measuring language model similarity based on overlap in mistakes, and uses it to study how model similarity affects AI oversight through LLM-as-a-judge evaluations and weak-to-strong generalization.

Key Result

Model mistakes become more similar with increasing capabilities, creating correlated failure risks for AI oversight, and LLM-as-a-judge scores exhibit systematic bias favoring models similar to the judge.

Source