ASI Safety Strategies

Once AI vastly exceeds human capabilities (ASI — artificial superintelligence), human oversight becomes fundamentally inadequate as a safety mechanism. ASI safety presents qualitatively different challenges than AGI safety. The AI Safety Atlas (Ch.3.5) identifies four core challenges and four strategic approaches.

Why ASI Safety Is Different

Four challenges that distinguish ASI from AGI:

  • Human oversight inadequacy“we lose our ability to evaluate their reasoning, verify their outputs, or provide meaningful feedback.” Alignment cannot rely on human judgment.
  • The one-shot requirement — may only get one chance before system is too capable to contain. Contested by takeoff-speed assumptions — gradual takeoff allows iteration.
  • Permanent value preservation — recursive self-improvement may rewrite core algorithms; alignment must survive.
  • Civilizational-scale control — ASI’s enormous capability requires preserving human agency across long-term trajectories.

Four Strategic Approaches

1. Automate Alignment Research

The OpenAI superalignment plan: delegate alignment research to advanced AI. Three components:

  1. Train AI using human feedback (RLHF-like)
  2. Develop AI to assist human evaluation of complex tasks
  3. Build LMs producing human-level alignment research

Differential acceleration — maximize alignment research impact while minimizing capability acceleration. Cyborgism enhances by training specialized humans to guide base LMs through prompt engineering.

Critique: the “slop” risk — early TAI may produce flawed-but-plausible alignment solutions; labs accept due to verification difficulty + AI sycophancy + organizational pressure → build misaligned ASI on flawed foundations. (See atlas-ch3-strategies-04-agi-safety-strategies.)

2. Safety-by-Design

Deep learning may have “potentially unpatchable failure modes.” Build ASI with inherent safety properties through formal methods or architectural constraints.

  • Guaranteed Safe AI (GSAI) — formal world models, safety specifications, verification mechanisms
  • Provably safe systems — mathematical proofs as safety cornerstones; potentially proof-carrying code rather than deep learning
  • Learning-theoretic agendas — alternative formal alignment frameworks (the SR2025 agenda)
  • Scientist AI — non-agentic models that accelerate science without pursuing goals — Bengio’s LawZero is the canonical institutional bet

These accept higher alignment taxes and require international cooperation. Critics: difficulty creating accurate world models, formally specifying complex properties like “harm,” verifying highly complex systems.

3. World Coordination

  • Global moratorium — delaying ASI by ≥10 years could reduce catastrophic risk; requires democratic discussion of trade-offs
  • Tool AI alternative — focus on specialized non-agentic systems (medical, weather)
  • International institutions:
    • CERN-like body — research collaboration on safety challenges
    • MAGIC (Multilateral AGI Consortium) — centralized institution managing final AGI/ASI development under strict protocols
    • Intelsat model — international treaty-based governance for deployed dual-use technology

Historical precedent: Nuclear Non-Proliferation Treaty, Biological Weapons Convention, Montreal Protocol all show catastrophic-risk cooperation is achievable.

4. Deterrence

Mutual Assured AI Malfunction (MAIM) — deterrence regime where unilateral ASI dominance attempts trigger sabotage by rivals. Resembles nuclear MAD.

Limitations: ASI development lacks clear detection thresholds; distributed/concealed training possible; nations rarely escalate sufficiently.

Yudkowsky’s extreme position: completely halt AI research, shut down GPU clusters, limit compute, enforced by military action if necessary.

Conceptual Frameworks

Pivotal Acts

First aligned ASI performs decisive actions permanently ending the acute risk period — disabling global computing, establishing unbreakable agreements. Critics: militarizes development, contradicts democratic governance.

Pivotal Processes

Alternative: distributed coordination using aligned AI to improve human decision-making and governance — preserving human agency.

The Strawberry Problem

“Can we create an AI duplicating a strawberry at the cellular level, placing both on a plate, then stopping completely?” Tests whether ASI alignment requires mathematical specification precision or pragmatic robust beneficial goals.

Philosophical Layer

ASI safety strategy cannot avoid philosophical questions covered in atlas-ch3-strategies-09-appendix-long-term-questions:

  • What values? — see coherent-extrapolated-volition (CEV/CAV/CBV)
  • Whose values? — see alignment-to-whom (single-single/single-multi/multi-single/multi-multi)
  • Survival vs. flourishing? — MacAskill’s viatopia framing
  • Worthy successor? — Faggella, Sutton’s positions

Connection to Wiki

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.