Evaluations
Measuring whether AI systems have crossed risk-relevant capability thresholds — dangerous-capabilities, deceptive-alignment propensity, autonomy, autonomous-replication, situational awareness — before deployment. Capability evaluations are the trigger mechanism in every modern frontier-safety framework: Anthropic’s RSP, OpenAI’s Preparedness Framework, and DeepMind’s FSF all gate decisions on eval results. See capability-evaluations for the methodology, capability-evals for the SR2025 research agenda, and model-organisms-of-misalignment for the testbed methodology that validates the evals.
The pillar’s central methodological problem: behavioral evaluation has a known ceiling against strategically-deceptive systems — see chain-of-thought-monitoring and the interpretability pillar for the proposed escape hatches.