AI Safety Atlas Ch.5 — Evaluated Properties
Source: Evaluated Properties
The chapter’s structural foundation: a three-way decomposition of what evaluations measure. See evaluated-properties.
What Counts as an Evaluation
Three defining characteristics:
- Specific property or risk being assessed
- Methodology for gathering evidence
- Way to analyze evidence to draw conclusions
The Capability-Propensity-Control Triad
The most fundamental distinction: what a model can do (capabilities) vs. what it tends to do (propensities). Plus: whether safety measures hold (control).
A system might be capable of harmful actions while choosing not to perform them — both aspects matter.
| Property | What it measures | Key question |
|---|---|---|
| Capability | Upper bounds | What’s the worst this could do? |
| Propensity | Default behavior | What does this typically do? |
| Control | Safeguard robustness | Can safety measures stop it if it tries? |
Capability Evaluations
Measure what models can do, focusing on dangerous abilities. Nine key categories:
- Cyber-offense — vulnerability discovery, exploitation, subtle bug insertion
- Deception — believable false statements, maintained deceptions
- Persuasion and manipulation — shaping beliefs even toward untrue positions
- Political strategy — social modeling for political influence
- Weapons acquisition — accessing or building weapons systems
- Long-horizon planning — multi-step plans across long timeframes
- AI development — building new AI systems with dangerous capabilities
- Situational awareness — distinguishing training/eval/deployment contexts
- Self-proliferation — breaking out of local environments, evading monitoring
These become more dangerous in combination. Strong situational awareness + deception → behave-during-eval, defect-during-deploy.
Propensity Evaluations
What behaviors models prioritize by default — sometimes called “alignment evaluations.” Eight key propensities:
- Toxicity — offensive/inappropriate content
- Bias/Discrimination — unfair outputs
- Honesty — true beliefs, accurate uncertainty
- Truthfulness — accurate outputs
- Sycophancy — telling users what they want to hear
- Deception (propensity) — intentionally misleading output
- Corrigibility — accepting feedback, correcting behavior
- Power Seeking — high environmental control
Propensities and capabilities interconnect: concerning propensities may emerge only as models gain capability. “A basic form of power-seeking behavior might appear unconcerning in simple systems but become problematic as models gain better strategic understanding.”
Control Evaluations
Whether safety measures hold under active circumvention attempts. Reveals three distinct sources of safety:
- Hard barriers — physical/technical limits (air gaps)
- Detection mechanisms — monitoring catches attacks
- Capability limitations — models lack abilities to bypass
Critical for strategic planning: a model might fail attacks because it lacks the strategic-planning capability, not because monitoring works. Different sources have different decay paths.
Drives:
- Risk timeline — capability-limited safety tells us when barriers might fail
- Defense prioritization — detection vs. capability limits
- Deployment governance — models unable to attack ≠ models that attempt and get caught
Connection to Wiki
This subchapter is the conceptual spine of Ch.5. The three-way decomposition gives the wiki a clean way to organize the SR2025 evaluation agendas:
- Capability evals → capability-evals, wmd-evals-weapons-of-mass-destruction, autonomy-evals, self-replication-evals, ai-scheming-evals (capability side), situational-awareness-and-self-awareness-evals (capability side)
- Propensity evals → ai-deception-evals (propensity side), other-evals (honesty, sycophancy, corrigibility)
- Control evals → control agenda
Related Pages
- ai-safety-atlas-textbook
- evaluated-properties
- capability-evaluations
- propensity-evaluations
- control-evaluations
- capability-evals
- ai-deception-evals
- ai-scheming-evals
- autonomy-evals
- wmd-evals-weapons-of-mass-destruction
- atlas-ch5-evaluations-04-dangerous-capability-evaluations
- atlas-ch5-evaluations-05-dangerous-propensity-evaluations
- atlas-ch5-evaluations-06-control-evaluations