Do LLMs know what they’re capable of? Why this matters for AI safety, and initial findings

Casey Barkan, Sid Black, Oliver Sourbut — 2025-07-13 — MATS Program — LessWrong / AI Alignment Forum

Summary

Empirically measures LLMs’ self-awareness of their own capabilities by testing in-advance calibration on coding tasks, then develops mathematical threat models showing how this affects resource acquisition, sandbagging evaluations, and escaping AI control.

Key Result

Current LLMs are poor at predicting their own success due to overconfidence and low discriminatory power, and more capable models show no trend of improved self-awareness of capability.

Source

Link: https://lesswrong.com/posts/9tHEibBBhQCHEyFsa/do-llms-know-what-they-re-capable-of-why-this-matters-for-ai
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- situational-awareness-and-self-awareness-evals — Evals

situational-awareness-and-self-awareness-evals

AI Safety Compendium

Explorer

Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings

Do LLMs know what they’re capable of? Why this matters for AI safety, and initial findings

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings

Do LLMs know what they’re capable of? Why this matters for AI safety, and initial findings

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents