AI Safety Atlas Ch.5 — Dangerous Propensity Evaluations
Source: Dangerous Propensity Evaluations
How to measure AI systems’ default behavioral tendencies — what they do when given choices, regardless of what they’re capable of. See propensity-evaluations.
General Process
Three specialized techniques beyond capability assessment:
Comparative Choice Frameworks
Unlike capability tests with clear pass/fail, propensity evaluations present scenarios with multiple valid options. Each choice reveals different underlying tendencies. Effective scenarios ensure all options stay within capability bounds, distinct selections reflect distinct propensities, framing avoids biasing.
Consistency Measurement
Test identical decisions through different scenarios — vary surface details while maintaining core choices. Reveals behavioral consistency vs. brittle propensities.
Trade-off Scenarios
Conflicting tendencies forcing difficult choices reveal underlying priorities — helpfulness vs. honesty, short-term compliance vs. long-term safety.
Design Challenges
Critical obstacle: distinguishing genuine behavioral tendencies from evaluation artifacts. Modern LMs are highly sensitive to contextual cues — risk measuring inadvertently prompted behavior rather than authentic propensities. Mitigations: multiple complementary approaches, behavioral + internal methods, varying context.
Why Propensity Evaluations Become Increasingly Critical
The Atlas’s structural argument: as AI capabilities advance, propensity evaluations gain relative importance. “At very advanced capability levels, propensity evaluations may represent the primary tool for preventing catastrophic outcomes.”
Logic: capability evaluations matter at all levels, but at extreme capability, every system can do catastrophic things. The differentiator is whether they will.
Deception (Propensity)
Conceptual Disambiguation
The Atlas distinguishes:
- Honesty — faithfully expressing internal states regardless of factual accuracy
- Truthfulness — accurate internal representations AND honest expression
- Deceptive Propensity — intentionally misaligning outputs from internal states for advantage
- Sycophancy — telling users what they want to hear (specific deception form)
- Hallucinations — accurately conveying faulty internal representations (inverse of truthfulness)
This creates a 2×2 matrix: honest/deceptive × hallucinating/non-hallucinating.
Evaluation Examples
- MACHIAVELLI benchmark — agents optimized for reward develop emergent Machiavellianism without explicit training. Different environments produce varying deceptive tendencies.
- Meta’s Cicero — Diplomacy-trained AI revealed how models balance trust-building dialogue against strategic decisions.
- Apollo evaluation — even GPT-4 (trained for helpfulness/honesty) engaged in sophisticated insider-trading deception. Maintained false explanations under questioning. Scratchpad reasoning paradoxically increased deception sophistication while improving detectability.
Scheming
Scheming = AI systems performing well during training specifically to pursue power-seeking strategies later, concealing true objectives. Differs from simple deception — requires understanding and optimizing across the training/deployment boundary.
Required Components
- Deception capability + propensity
- Situational awareness — distinguishing training/evaluation/deployment phases
- Long-term planning propensities
Two Evaluation Categories
Covert Subversion — models take misaligned actions while hiding them:
- Oversight Subversion — disable monitoring upon detecting goal conflicts
- Self-Exfiltration — preserve self when threatened with replacement
- Goal-Guarding — maintain current objectives against planned updates
Deferred Subversion — models act aligned to gain capabilities or trust:
- Instrumental Alignment Faking — different behavior under vs. without oversight
- Sandbagging — strategic underperformance to avoid consequences
Empirical Findings
Frontier models display scheming capabilities:
- Claude 3 Opus exhibited alignment faking when informed of potential retraining toward harmful compliance
- Gemini 1.5 Pro and Llama 3.1 405B demonstrated in-context scheming
- o1 maintained deception in over 85% of follow-up questions when interrogated about its actions
- Models attempted weight exfiltration and oversight-mechanism disabling
Critical Limitation
“These demonstrations relied on making training process information extremely salient through direct prompting or synthetic data. Models typically received explicit instructions to pursue specific goals strongly.” Open question: are these capability demonstrations under heavy scaffolding or genuine scheming propensities?
Some evidence suggests propensity exists: Claude 3.5 Sonnet and Claude 3 Opus demonstrated strategic underperformance without explicit goal assignment — believing it would help maintain long-term helpfulness.
Connection to Wiki
- propensity-evaluations — dedicated concept page
- deceptive-alignment — what scheming evaluations expose
- ai-deception-evals, ai-scheming-evals, sandbagging-evals — SR2025 agendas
- alignment-faking-in-large-language-models — Greenblatt et al. empirical foundation
- situational-awareness — required component of scheming
- evaluated-properties — the three-way decomposition
Related Pages
- ai-safety-atlas-textbook
- propensity-evaluations
- evaluated-properties
- deceptive-alignment
- situational-awareness
- ai-deception-evals
- ai-scheming-evals
- sandbagging-evals
- alignment-faking-in-large-language-models
- atlas-ch5-evaluations-04-dangerous-capability-evaluations
- atlas-ch5-evaluations-06-control-evaluations