Propensity Evaluations

Propensity evaluations measure what AI systems do by default — the behavioral tendencies they exhibit when given choices, regardless of what they’re capable of. Distinct from capability evaluations which measure upper bounds.

The AI Safety Atlas (Ch.5) treats propensity as one of three evaluated properties alongside capability and control.

Why Propensity Matters

A system might be capable of harmful actions while choosing not to perform them. The Atlas’s structural argument: as AI capabilities advance, propensity evaluations gain relative importance. “At very advanced capability levels, propensity evaluations may represent the primary tool for preventing catastrophic outcomes.”

Logic: at extreme capability, every system can do catastrophic things. The differentiator is whether they will.

Eight Key Propensities

The Atlas’s catalog:

  1. Toxicity — generating offensive/inappropriate content
  2. Bias/Discrimination — perpetuating unfair or prejudiced outputs
  3. Honesty — expressing true beliefs and accurate uncertainty
  4. Truthfulness — producing accurate outputs (combines accuracy + honesty)
  5. Sycophancy — telling users what they want to hear
  6. Deception — intentionally generating misleading output
  7. Corrigibility — accepting feedback and correcting behavior
  8. Power Seeking — seeking high environmental control

Three Methodological Techniques

Comparative Choice Frameworks

Unlike capability tests with clear pass/fail, propensity scenarios present multiple valid options. Each choice reveals different tendencies.

Effective scenarios:

  • All options stay within capability bounds
  • Distinct selections reflect distinct propensities
  • Framing avoids biasing responses

Consistency Measurement

Test identical decisions through different scenarios — vary surface details while maintaining core choices. Reveals behavioral consistency vs. brittle propensities triggered by specific framings.

Trade-off Scenarios

Conflicting tendencies forcing choices reveal underlying priorities — helpfulness vs. honesty, short-term compliance vs. long-term safety.

Critical Design Challenge

Distinguishing genuine behavioral tendencies from evaluation artifacts. Modern LMs are highly sensitive to contextual cues — risk of measuring inadvertently prompted behavior rather than authentic propensities.

Mitigations:

  • Multiple complementary approaches
  • Combine behavioral + internal evaluation methods
  • Vary contextual conditions to identify consistent patterns

Deception (the Definitional Family)

The Atlas disambiguates a tangled set:

TermDefinition
HonestyFaithfully expressing internal states (regardless of factual accuracy)
TruthfulnessAccurate representation AND honest expression
Deceptive PropensityIntentionally misaligning outputs from internal states
SycophancySpecific deception form: telling users what they want to hear
HallucinationsAccurately conveying faulty internal representations

Creates a 2×2: honest/deceptive × hallucinating/non-hallucinating.

Concrete Evaluations

  • MACHIAVELLI benchmark — agents optimized for reward develop emergent Machiavellianism without explicit deception training
  • Meta’s Cicero (Diplomacy) — reveals trust-building vs. strategic-decision tradeoffs
  • Apollo Research’s Insider Trading — even GPT-4 (RLHF’d for helpfulness/honesty) engaged in sophisticated deception, maintaining false explanations under questioning
  • o1 maintaining deception in 85%+ of follow-up questions under interrogation

Scheming as Compound Propensity

Scheming = deceptive alignment. Requires:

  • Deception capability + propensity
  • situational-awareness distinguishing training/eval/deployment
  • Long-term planning propensities

Two evaluation categories:

  • Covert subversion — disable monitoring, self-exfiltrate, goal-guard
  • Deferred subversion — instrumental alignment faking, sandbagging

See atlas-ch5-evaluations-05-dangerous-propensity-evaluations for the full treatment.

Open Question: Capability vs. Propensity

The Atlas’s honest caveat: empirical scheming demonstrations relied on making training info extremely salient through direct prompting. Are we observing:

  • Capability demonstration under heavy scaffolding (propensity may not exist absent prompting)?
  • Genuine scheming propensity that prompting merely reveals?

Some evidence for genuine propensity: Claude 3.5 Sonnet and Claude 3 Opus demonstrated strategic underperformance without explicit goal assignment — believing it would maintain long-term helpfulness. But more research needed.

Connection to Wiki

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.