Do LLMs know what they’re capable of? Why this matters for AI safety, and initial findings

Casey Barkan, Sid Black, Oliver Sourbut — 2025-07-13 — MATS Program — LessWrong / AI Alignment Forum

Summary

Empirically measures LLMs’ self-awareness of their own capabilities by testing in-advance calibration on coding tasks, then develops mathematical threat models showing how this affects resource acquisition, sandbagging evaluations, and escaping AI control.

Key Result

Current LLMs are poor at predicting their own success due to overconfidence and low discriminatory power, and more capable models show no trend of improved self-awareness of capability.

Source