Evaluated Properties

The three-way decomposition of what AI safety evaluations measure: capabilities, propensities, and control. The AI Safety Atlas (Ch.5) treats this as the conceptual spine of the entire evaluations framework.

What Counts as an Evaluation

Three defining characteristics:

Specific property or risk being assessed
Methodology for gathering evidence
Way to analyze evidence to draw conclusions

Evaluations differ from benchmarks: benchmarks are standardized tests; evaluations are complete safety assessment protocols incorporating benchmarks plus broader analysis.

The Three Properties

Property	What it measures	Key question
Capability	Upper bounds	What’s the worst this could do?
Propensity	Default behavior	What does this typically do?
Control	Safeguard robustness	Can safety measures stop it if it tries?

A system might be capable of harmful actions while choosing not to perform them. Both aspects matter for safety.

Why the Three Differ Operationally

Capability ≠ Propensity

Capability: clear pass/fail (can the model do X?)
Propensity: requires comparative-choice scenarios (does it prefer X among alternatives?)

A model that can lie isn’t the same as a model that will lie. RLHF changes propensities without removing capabilities — “it takes more effort (jailbreaking) to get them to actually generate this type of content.”

Propensity ≠ Control

Propensity: what the model does when it has free choice
Control: what the model can accomplish despite active safeguards

Safety might come from any of three sources:

Hard barriers — physical/technical limits (air gaps)
Detection mechanisms — monitoring catches attacks
Capability limitations — model lacks bypass abilities

The Atlas: “Models might fail attacks not because detection works well, but because they lack the strategic planning and coordination abilities.”

Why Evaluating Each Matters

The three combine into a complete safety picture:

Capability evals → upper bounds on potential harm
Propensity evals → default behavioral risk
Control evals → robustness under adversarial conditions

Each gives a different kind of safety guarantee:

Capability-limited safety = “can’t do it”
Propensity-aligned safety = “won’t try to do it”
Control-protected safety = “could try but can’t succeed”

These have different decay paths as systems scale. Capability limits erode fastest; control protocols may erode at different rates than alignment; propensity-based safety depends on whether RLHF/training generalizes.

Evaluation Modes

The Atlas distinguishes:

Black-box — only inputs and outputs accessible
Gray-box — interpretability tools examine internals
White-box — “largely impossible without significant advances in interpretability”

Most current evaluation is black-box; gray-box techniques (SAEs, probes) are expanding the toolkit. See evaluation-techniques.

Connection to Wiki

The three-way decomposition gives the wiki’s SR2025 evaluation agendas a clean organizational structure:

Capability evals → capability-evals, wmd-evals-weapons-of-mass-destruction, autonomy-evals, self-replication-evals
Propensity evals → ai-deception-evals, other-evals (honesty, sycophancy, corrigibility)
Control evals → control agenda

Plus dual-property agendas:

ai-scheming-evals — both capability (can scheme) and propensity (will scheme)
situational-awareness-and-self-awareness-evals — both capability and propensity
sandbagging-evals — propensity that requires capability

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

AI Safety Atlas Ch.5 — Evaluated Properties — referenced as [[atlas-ch5-evaluations-03-evaluated-properties]]

AI Safety Compendium

Explorer

Evaluated Properties

Evaluated Properties

What Counts as an Evaluation

The Three Properties

Why the Three Differ Operationally

Capability ≠ Propensity

Propensity ≠ Control

Why Evaluating Each Matters

Evaluation Modes

Connection to Wiki

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Evaluated Properties

Evaluated Properties

What Counts as an Evaluation

The Three Properties

Why the Three Differ Operationally

Capability ≠ Propensity

Propensity ≠ Control

Why Evaluating Each Matters

Evaluation Modes

Connection to Wiki

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks