A Pragmatic Way to Measure Chain-of-Thought Monitorability
Scott Emmons, Roland S. Zimmermann, David K. Elson, Rohin Shah — 2025-10-28 — Google DeepMind — arXiv
Summary
Proposes metrics to measure two components of Chain-of-Thought monitorability (legibility and coverage) using an LLM-based autorater, validates the approach with synthetic degradations, and applies it to frontier models on challenging benchmarks.
Key Result
Frontier models exhibit high default monitorability on challenging benchmarks when measured using the proposed legibility and coverage metrics.
Source
- Link: https://arxiv.org/abs/2510.23966
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- google-deepmind — Labs (giant companies)
- chain-of-thought-monitoring — Black-box safety (understand and control current model behaviour) / Iterative alignment