Evaluations

Measuring whether AI systems have crossed risk-relevant capability thresholds — dangerous-capabilities, deceptive-alignment propensity, autonomy, autonomous-replication, situational awareness — before deployment. Capability evaluations are the trigger mechanism in every modern frontier-safety framework: Anthropic’s RSP, OpenAI’s Preparedness Framework, and DeepMind’s FSF all gate decisions on eval results. See capability-evaluations for the methodology, capability-evals for the SR2025 research agenda, and model-organisms-of-misalignment for the testbed methodology that validates the evals.

The pillar’s central methodological problem: behavioral evaluation has a known ceiling against strategically-deceptive systems — see chain-of-thought-monitoring and the interpretability pillar for the proposed escape hatches.

AI Safety Compendium

Explorer

Evaluations

Evaluations

Pages tagged here

Capability Evals

Chain of Thought Monitoring

Capability Evaluations

Dangerous Capabilities

Model Organisms of Misalignment

Situational Awareness