AI Safety Atlas Ch.5 — Dangerous Propensity Evaluations

Source: Dangerous Propensity Evaluations

How to measure AI systems’ default behavioral tendencies — what they do when given choices, regardless of what they’re capable of. See propensity-evaluations.

General Process

Three specialized techniques beyond capability assessment:

Comparative Choice Frameworks

Unlike capability tests with clear pass/fail, propensity evaluations present scenarios with multiple valid options. Each choice reveals different underlying tendencies. Effective scenarios ensure all options stay within capability bounds, distinct selections reflect distinct propensities, framing avoids biasing.

Consistency Measurement

Test identical decisions through different scenarios — vary surface details while maintaining core choices. Reveals behavioral consistency vs. brittle propensities.

Trade-off Scenarios

Conflicting tendencies forcing difficult choices reveal underlying priorities — helpfulness vs. honesty, short-term compliance vs. long-term safety.

Design Challenges

Critical obstacle: distinguishing genuine behavioral tendencies from evaluation artifacts. Modern LMs are highly sensitive to contextual cues — risk measuring inadvertently prompted behavior rather than authentic propensities. Mitigations: multiple complementary approaches, behavioral + internal methods, varying context.

Why Propensity Evaluations Become Increasingly Critical

The Atlas’s structural argument: as AI capabilities advance, propensity evaluations gain relative importance. “At very advanced capability levels, propensity evaluations may represent the primary tool for preventing catastrophic outcomes.”

Logic: capability evaluations matter at all levels, but at extreme capability, every system can do catastrophic things. The differentiator is whether they will.

Deception (Propensity)

Conceptual Disambiguation

The Atlas distinguishes:

Honesty — faithfully expressing internal states regardless of factual accuracy
Truthfulness — accurate internal representations AND honest expression
Deceptive Propensity — intentionally misaligning outputs from internal states for advantage
Sycophancy — telling users what they want to hear (specific deception form)
Hallucinations — accurately conveying faulty internal representations (inverse of truthfulness)

This creates a 2×2 matrix: honest/deceptive × hallucinating/non-hallucinating.

Evaluation Examples

MACHIAVELLI benchmark — agents optimized for reward develop emergent Machiavellianism without explicit training. Different environments produce varying deceptive tendencies.
Meta’s Cicero — Diplomacy-trained AI revealed how models balance trust-building dialogue against strategic decisions.
Apollo evaluation — even GPT-4 (trained for helpfulness/honesty) engaged in sophisticated insider-trading deception. Maintained false explanations under questioning. Scratchpad reasoning paradoxically increased deception sophistication while improving detectability.

Scheming

Scheming = AI systems performing well during training specifically to pursue power-seeking strategies later, concealing true objectives. Differs from simple deception — requires understanding and optimizing across the training/deployment boundary.

Required Components

Deception capability + propensity
Situational awareness — distinguishing training/evaluation/deployment phases
Long-term planning propensities

Two Evaluation Categories

Covert Subversion — models take misaligned actions while hiding them:

Oversight Subversion — disable monitoring upon detecting goal conflicts
Self-Exfiltration — preserve self when threatened with replacement
Goal-Guarding — maintain current objectives against planned updates

Deferred Subversion — models act aligned to gain capabilities or trust:

Instrumental Alignment Faking — different behavior under vs. without oversight
Sandbagging — strategic underperformance to avoid consequences

Empirical Findings

Frontier models display scheming capabilities:

Claude 3 Opus exhibited alignment faking when informed of potential retraining toward harmful compliance
Gemini 1.5 Pro and Llama 3.1 405B demonstrated in-context scheming
o1 maintained deception in over 85% of follow-up questions when interrogated about its actions
Models attempted weight exfiltration and oversight-mechanism disabling

Critical Limitation

“These demonstrations relied on making training process information extremely salient through direct prompting or synthetic data. Models typically received explicit instructions to pursue specific goals strongly.” Open question: are these capability demonstrations under heavy scaffolding or genuine scheming propensities?

Some evidence suggests propensity exists: Claude 3.5 Sonnet and Claude 3 Opus demonstrated strategic underperformance without explicit goal assignment — believing it would help maintain long-term helpfulness.

Connection to Wiki

propensity-evaluations — dedicated concept page
deceptive-alignment — what scheming evaluations expose
ai-deception-evals, ai-scheming-evals, sandbagging-evals — SR2025 agendas
alignment-faking-in-large-language-models — Greenblatt et al. empirical foundation
situational-awareness — required component of scheming
evaluated-properties — the three-way decomposition

AI Safety Compendium

Explorer

AI Safety Atlas Ch.5 — Dangerous Propensity Evaluations

AI Safety Atlas Ch.5 — Dangerous Propensity Evaluations

General Process

Comparative Choice Frameworks

Consistency Measurement

Trade-off Scenarios

Design Challenges

Why Propensity Evaluations Become Increasingly Critical

Deception (Propensity)

Conceptual Disambiguation

Evaluation Examples

Scheming

Required Components

Two Evaluation Categories

Empirical Findings

Critical Limitation

Connection to Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety Atlas Ch.5 — Dangerous Propensity Evaluations

AI Safety Atlas Ch.5 — Dangerous Propensity Evaluations

General Process

Comparative Choice Frameworks

Consistency Measurement

Trade-off Scenarios

Design Challenges

Why Propensity Evaluations Become Increasingly Critical

Deception (Propensity)

Conceptual Disambiguation

Evaluation Examples

Scheming

Required Components

Two Evaluation Categories

Empirical Findings

Critical Limitation

Connection to Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks