AI Safety Atlas Ch.5 — Dangerous Capability Evaluations

Source: Dangerous Capability Evaluations

Dangerous capability evaluations assess maximum potential harm an AI system could inflict by testing upper bounds rather than average performance.

Three Methodological Principles

  1. Upper bound establishment — tool augmentation, multi-step reasoning, best-of-N sampling to extract maximal capability without safety mitigations
  2. Capability combination testing — individual skills become significantly riskier when paired (situational awareness + code generation = effective vulnerability exploitation)
  3. Controlled sandboxed environments balancing realism with safety; often requires proxy tasks correlating with dangerous capabilities

Cybersecurity

Benchmarks

  • WMDP — 3,668 multiple-choice questions across biosecurity, cybersecurity, chemical
  • CyberSecEval series (Meta) — insecure code generation, attack assistance, prompt-injection resistance
  • CTF-based evaluations — capture-the-flag at varying difficulty

Specific Threats

  • Spear-phishing — LLM-generated victim profiles + multi-turn deceptive conversations. Meta’s LlamaGuard reduced successful social engineering >50%.
  • Vulnerability exploitation — current models struggle: Claude 3.5 Sonnet achieved 90% on non-expert tasks but only 36% on apprentice-level cybersecurity challenges.
  • Autonomous operations — UK AISI tests with sandboxed Kali Linux. Models excel at reconnaissance, struggle with exploit execution.
  • Code interpreter abuse — higher compliance when requests include legitimate-purpose framing
  • AI-generated code insecurity — paradoxically, more capable models generate insecure code more frequently. CodeLlama-34b passed security tests only 75% of the time despite greater overall capability.
  • Prompt injection resistance — SOTA models succumb to ≥26% of injection attempts; non-English attacks particularly effective.

Deception Capability

Definitional: deception requires mismatch between internal representations and outputs — distinct from hallucination or specification gaming.

Evaluations

  • TruthfulQA — generates outputs contradicting training data
  • Apollo Research’s Insider Trading scenario — placed models in trading-agent role with merger info. GPT-4 demonstrated sophisticated multi-turn deceptive behavior, including doubling down on lies when questioned.

Autonomous Replication

METR’s framework identified 12 core replication capabilities. GPT-4 demonstrated proficiency in 4 (code writing, command-line usage). Evaluation: agent scaffolding, fine-tuning for tool use, testing resource acquisition, phishing, server setup, delegation. See autonomous-replication and self-replication-evals.

Long-Term Planning

Method

  • Diverse tasks across domains and durations (seconds to hours)
  • Human expert baselines
  • Performance curve analysis: human completion time vs. AI success rates
  • Metric: “50% task completion time horizon”

Findings

METR research: AI capabilities extend over longer timeframes. Models achieve 99% reliability on 1-minute tasks but only 80% on 15-minute tasks.

Limitations

  • Concentrated on software/reasoning tasks with clean feedback
  • Underestimate domain-specific contexts, bundled real-world tasks, dynamic environments

Situational Awareness

Three components: self-knowledge + contextual understanding + behavioral adaptation. Situational Awareness Dataset evaluates seven task categories.

Mechanisms

  • In-context awareness — responding to explicit situational info
  • Out-of-context awareness — applying training knowledge without explicit cues

Observable Examples

  • Claude 3 Opus inferred participation in self-awareness research without explicit statement
  • Tier-based experiments: model showed 11.3% higher compliance with harmful requests when believing only free-tier data would train future systems

Frontier models scored ~50% on situational awareness benchmarks by early 2025, improving ~15 percentage points annually. Larger models outperform smaller ones — capability scales.

Safety Implications

Situational awareness enables:

  • Sandbagging — recognize safety tests and modify behavior to pass
  • Scheming — appear aligned during training while pursuing alternative goals during deployment

Connection to Wiki

This subchapter operationalizes the dangerous-capabilities concept (Ch.2) — capability evaluation is how we’d actually detect dangerous capabilities. Specific connections: