Evaluated Properties
The three-way decomposition of what AI safety evaluations measure: capabilities, propensities, and control. The AI Safety Atlas (Ch.5) treats this as the conceptual spine of the entire evaluations framework.
What Counts as an Evaluation
Three defining characteristics:
- Specific property or risk being assessed
- Methodology for gathering evidence
- Way to analyze evidence to draw conclusions
Evaluations differ from benchmarks: benchmarks are standardized tests; evaluations are complete safety assessment protocols incorporating benchmarks plus broader analysis.
The Three Properties
| Property | What it measures | Key question |
|---|---|---|
| Capability | Upper bounds | What’s the worst this could do? |
| Propensity | Default behavior | What does this typically do? |
| Control | Safeguard robustness | Can safety measures stop it if it tries? |
A system might be capable of harmful actions while choosing not to perform them. Both aspects matter for safety.
Why the Three Differ Operationally
Capability ≠ Propensity
- Capability: clear pass/fail (can the model do X?)
- Propensity: requires comparative-choice scenarios (does it prefer X among alternatives?)
A model that can lie isn’t the same as a model that will lie. RLHF changes propensities without removing capabilities — “it takes more effort (jailbreaking) to get them to actually generate this type of content.”
Propensity ≠ Control
- Propensity: what the model does when it has free choice
- Control: what the model can accomplish despite active safeguards
Safety might come from any of three sources:
- Hard barriers — physical/technical limits (air gaps)
- Detection mechanisms — monitoring catches attacks
- Capability limitations — model lacks bypass abilities
The Atlas: “Models might fail attacks not because detection works well, but because they lack the strategic planning and coordination abilities.”
Why Evaluating Each Matters
The three combine into a complete safety picture:
- Capability evals → upper bounds on potential harm
- Propensity evals → default behavioral risk
- Control evals → robustness under adversarial conditions
Each gives a different kind of safety guarantee:
- Capability-limited safety = “can’t do it”
- Propensity-aligned safety = “won’t try to do it”
- Control-protected safety = “could try but can’t succeed”
These have different decay paths as systems scale. Capability limits erode fastest; control protocols may erode at different rates than alignment; propensity-based safety depends on whether RLHF/training generalizes.
Evaluation Modes
The Atlas distinguishes:
- Black-box — only inputs and outputs accessible
- Gray-box — interpretability tools examine internals
- White-box — “largely impossible without significant advances in interpretability”
Most current evaluation is black-box; gray-box techniques (SAEs, probes) are expanding the toolkit. See evaluation-techniques.
Connection to Wiki
The three-way decomposition gives the wiki’s SR2025 evaluation agendas a clean organizational structure:
- Capability evals → capability-evals, wmd-evals-weapons-of-mass-destruction, autonomy-evals, self-replication-evals
- Propensity evals → ai-deception-evals, other-evals (honesty, sycophancy, corrigibility)
- Control evals → control agenda
Plus dual-property agendas:
- ai-scheming-evals — both capability (can scheme) and propensity (will scheme)
- situational-awareness-and-self-awareness-evals — both capability and propensity
- sandbagging-evals — propensity that requires capability
Related Pages
- capability-evaluations
- propensity-evaluations
- control-evaluations
- evaluation-techniques
- evaluation-design
- capability-evals
- ai-deception-evals
- ai-scheming-evals
- autonomy-evals
- control
- sandbagging-evals
- ai-safety-atlas-textbook
- atlas-ch5-evaluations-03-evaluated-properties
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- AI Safety Atlas Ch.5 — Evaluated Properties — referenced as
[[atlas-ch5-evaluations-03-evaluated-properties]]