Do LLMs know what they’re capable of? Why this matters for AI safety, and initial findings
Casey Barkan, Sid Black, Oliver Sourbut — 2025-07-13 — MATS Program — LessWrong / AI Alignment Forum
Summary
Empirically measures LLMs’ self-awareness of their own capabilities by testing in-advance calibration on coding tasks, then develops mathematical threat models showing how this affects resource acquisition, sandbagging evaluations, and escaping AI control.
Key Result
Current LLMs are poor at predicting their own success due to overconfidence and low discriminatory power, and more capable models show no trend of improved self-awareness of capability.
Source
- Link: https://lesswrong.com/posts/9tHEibBBhQCHEyFsa/do-llms-know-what-they-re-capable-of-why-this-matters-for-ai
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):