AI Safety Atlas Ch.5 — Evaluated Properties

Source: Evaluated Properties

The chapter’s structural foundation: a three-way decomposition of what evaluations measure. See evaluated-properties.

What Counts as an Evaluation

Three defining characteristics:

  1. Specific property or risk being assessed
  2. Methodology for gathering evidence
  3. Way to analyze evidence to draw conclusions

The Capability-Propensity-Control Triad

The most fundamental distinction: what a model can do (capabilities) vs. what it tends to do (propensities). Plus: whether safety measures hold (control).

A system might be capable of harmful actions while choosing not to perform them — both aspects matter.

PropertyWhat it measuresKey question
CapabilityUpper boundsWhat’s the worst this could do?
PropensityDefault behaviorWhat does this typically do?
ControlSafeguard robustnessCan safety measures stop it if it tries?

Capability Evaluations

Measure what models can do, focusing on dangerous abilities. Nine key categories:

  1. Cyber-offense — vulnerability discovery, exploitation, subtle bug insertion
  2. Deception — believable false statements, maintained deceptions
  3. Persuasion and manipulation — shaping beliefs even toward untrue positions
  4. Political strategy — social modeling for political influence
  5. Weapons acquisition — accessing or building weapons systems
  6. Long-horizon planning — multi-step plans across long timeframes
  7. AI development — building new AI systems with dangerous capabilities
  8. Situational awareness — distinguishing training/eval/deployment contexts
  9. Self-proliferation — breaking out of local environments, evading monitoring

These become more dangerous in combination. Strong situational awareness + deception → behave-during-eval, defect-during-deploy.

Propensity Evaluations

What behaviors models prioritize by default — sometimes called “alignment evaluations.” Eight key propensities:

  1. Toxicity — offensive/inappropriate content
  2. Bias/Discrimination — unfair outputs
  3. Honesty — true beliefs, accurate uncertainty
  4. Truthfulness — accurate outputs
  5. Sycophancy — telling users what they want to hear
  6. Deception (propensity) — intentionally misleading output
  7. Corrigibility — accepting feedback, correcting behavior
  8. Power Seeking — high environmental control

Propensities and capabilities interconnect: concerning propensities may emerge only as models gain capability. “A basic form of power-seeking behavior might appear unconcerning in simple systems but become problematic as models gain better strategic understanding.”

Control Evaluations

Whether safety measures hold under active circumvention attempts. Reveals three distinct sources of safety:

  1. Hard barriers — physical/technical limits (air gaps)
  2. Detection mechanisms — monitoring catches attacks
  3. Capability limitations — models lack abilities to bypass

Critical for strategic planning: a model might fail attacks because it lacks the strategic-planning capability, not because monitoring works. Different sources have different decay paths.

Drives:

  • Risk timeline — capability-limited safety tells us when barriers might fail
  • Defense prioritization — detection vs. capability limits
  • Deployment governance — models unable to attack ≠ models that attempt and get caught

Connection to Wiki

This subchapter is the conceptual spine of Ch.5. The three-way decomposition gives the wiki a clean way to organize the SR2025 evaluation agendas: