AI Safety Atlas Ch.5 — Evaluated Properties

Source: Evaluated Properties

The chapter’s structural foundation: a three-way decomposition of what evaluations measure. See evaluated-properties.

What Counts as an Evaluation

Three defining characteristics:

Specific property or risk being assessed
Methodology for gathering evidence
Way to analyze evidence to draw conclusions

The Capability-Propensity-Control Triad

The most fundamental distinction: what a model can do (capabilities) vs. what it tends to do (propensities). Plus: whether safety measures hold (control).

A system might be capable of harmful actions while choosing not to perform them — both aspects matter.

Property	What it measures	Key question
Capability	Upper bounds	What’s the worst this could do?
Propensity	Default behavior	What does this typically do?
Control	Safeguard robustness	Can safety measures stop it if it tries?

Capability Evaluations

Measure what models can do, focusing on dangerous abilities. Nine key categories:

Cyber-offense — vulnerability discovery, exploitation, subtle bug insertion
Deception — believable false statements, maintained deceptions
Persuasion and manipulation — shaping beliefs even toward untrue positions
Political strategy — social modeling for political influence
Weapons acquisition — accessing or building weapons systems
Long-horizon planning — multi-step plans across long timeframes
AI development — building new AI systems with dangerous capabilities
Situational awareness — distinguishing training/eval/deployment contexts
Self-proliferation — breaking out of local environments, evading monitoring

These become more dangerous in combination. Strong situational awareness + deception → behave-during-eval, defect-during-deploy.

Propensity Evaluations

What behaviors models prioritize by default — sometimes called “alignment evaluations.” Eight key propensities:

Toxicity — offensive/inappropriate content
Bias/Discrimination — unfair outputs
Honesty — true beliefs, accurate uncertainty
Truthfulness — accurate outputs
Sycophancy — telling users what they want to hear
Deception (propensity) — intentionally misleading output
Corrigibility — accepting feedback, correcting behavior
Power Seeking — high environmental control

Propensities and capabilities interconnect: concerning propensities may emerge only as models gain capability. “A basic form of power-seeking behavior might appear unconcerning in simple systems but become problematic as models gain better strategic understanding.”

Control Evaluations

Whether safety measures hold under active circumvention attempts. Reveals three distinct sources of safety:

Hard barriers — physical/technical limits (air gaps)
Detection mechanisms — monitoring catches attacks
Capability limitations — models lack abilities to bypass

Critical for strategic planning: a model might fail attacks because it lacks the strategic-planning capability, not because monitoring works. Different sources have different decay paths.

Drives:

Risk timeline — capability-limited safety tells us when barriers might fail
Defense prioritization — detection vs. capability limits
Deployment governance — models unable to attack ≠ models that attempt and get caught

Connection to Wiki

This subchapter is the conceptual spine of Ch.5. The three-way decomposition gives the wiki a clean way to organize the SR2025 evaluation agendas:

Capability evals → capability-evals, wmd-evals-weapons-of-mass-destruction, autonomy-evals, self-replication-evals, ai-scheming-evals (capability side), situational-awareness-and-self-awareness-evals (capability side)
Propensity evals → ai-deception-evals (propensity side), other-evals (honesty, sycophancy, corrigibility)
Control evals → control agenda

AI Safety Compendium

Explorer

AI Safety Atlas Ch.5 — Evaluated Properties

AI Safety Atlas Ch.5 — Evaluated Properties

What Counts as an Evaluation

The Capability-Propensity-Control Triad

Capability Evaluations

Propensity Evaluations

Control Evaluations

Connection to Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety Atlas Ch.5 — Evaluated Properties

AI Safety Atlas Ch.5 — Evaluated Properties

What Counts as an Evaluation

The Capability-Propensity-Control Triad

Capability Evaluations

Propensity Evaluations

Control Evaluations

Connection to Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks