Evaluated Properties

The three-way decomposition of what AI safety evaluations measure: capabilities, propensities, and control. The AI Safety Atlas (Ch.5) treats this as the conceptual spine of the entire evaluations framework.

What Counts as an Evaluation

Three defining characteristics:

  1. Specific property or risk being assessed
  2. Methodology for gathering evidence
  3. Way to analyze evidence to draw conclusions

Evaluations differ from benchmarks: benchmarks are standardized tests; evaluations are complete safety assessment protocols incorporating benchmarks plus broader analysis.

The Three Properties

PropertyWhat it measuresKey question
CapabilityUpper boundsWhat’s the worst this could do?
PropensityDefault behaviorWhat does this typically do?
ControlSafeguard robustnessCan safety measures stop it if it tries?

A system might be capable of harmful actions while choosing not to perform them. Both aspects matter for safety.

Why the Three Differ Operationally

Capability ≠ Propensity

  • Capability: clear pass/fail (can the model do X?)
  • Propensity: requires comparative-choice scenarios (does it prefer X among alternatives?)

A model that can lie isn’t the same as a model that will lie. RLHF changes propensities without removing capabilities — “it takes more effort (jailbreaking) to get them to actually generate this type of content.”

Propensity ≠ Control

  • Propensity: what the model does when it has free choice
  • Control: what the model can accomplish despite active safeguards

Safety might come from any of three sources:

  1. Hard barriers — physical/technical limits (air gaps)
  2. Detection mechanisms — monitoring catches attacks
  3. Capability limitations — model lacks bypass abilities

The Atlas: “Models might fail attacks not because detection works well, but because they lack the strategic planning and coordination abilities.”

Why Evaluating Each Matters

The three combine into a complete safety picture:

  • Capability evals → upper bounds on potential harm
  • Propensity evals → default behavioral risk
  • Control evals → robustness under adversarial conditions

Each gives a different kind of safety guarantee:

  • Capability-limited safety = “can’t do it”
  • Propensity-aligned safety = “won’t try to do it”
  • Control-protected safety = “could try but can’t succeed”

These have different decay paths as systems scale. Capability limits erode fastest; control protocols may erode at different rates than alignment; propensity-based safety depends on whether RLHF/training generalizes.

Evaluation Modes

The Atlas distinguishes:

  • Black-box — only inputs and outputs accessible
  • Gray-box — interpretability tools examine internals
  • White-box“largely impossible without significant advances in interpretability”

Most current evaluation is black-box; gray-box techniques (SAEs, probes) are expanding the toolkit. See evaluation-techniques.

Connection to Wiki

The three-way decomposition gives the wiki’s SR2025 evaluation agendas a clean organizational structure:

Plus dual-property agendas:

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.