AI Safety Atlas Ch.3 — ASI Safety Strategies

Source: ASI Safety Strategies

Once AI vastly exceeds human capabilities, human oversight becomes fundamentally inadequate. ASI safety presents qualitatively different challenges than AGI safety. See asi-safety-strategies.

Four Core Challenges

  • Human oversight inadequacy“we lose our ability to evaluate their reasoning, verify their outputs, or provide meaningful feedback.” Alignment cannot rely on human judgment as safety mechanism.
  • The one-shot requirement — may only get one chance before system is too capable to contain. Contested by takeoff-speed assumptions — gradual development would allow iteration.
  • Permanent value preservation — superintelligent systems may recursively self-improve, rewriting core algorithms. Alignment must survive modification by intelligence vastly exceeding human-level.
  • Civilizational-scale control — ASI’s enormous capability requires preserving human agency across long-term trajectories.

Four Strategic Approaches

1. Automate Alignment Research

OpenAI’s Superalignment plan: delegate alignment research to advanced AI. Three components:

  1. Train AI using human feedback (RLHF-like)
  2. Develop AI to assist human evaluation of complex tasks
  3. Build LMs producing human-level alignment research

Relies on differential acceleration — maximizing alignment research impact while minimizing capability acceleration. Cyborgism enhances by training specialized humans to guide base LMs.

Criticisms: blind spots in the plan; verification difficulty for AI-proposed alignment solutions (“slop” risk from Ch.3.4).

This connects directly to the wiki’s existing superalignment page.

2. Safety-by-Design

Deep learning may have “potentially unpatchable failure modes.” Build ASI with inherent safety properties through formal methods or architectural constraints.

  • Guaranteed Safe AI (GSAI) — formal world models, safety specifications, verification
  • Provably safe systems — mathematical proofs as safety cornerstone (proof-carrying code, not deep learning)
  • Learning-theoretic agendas — alternative formal alignment frameworks

These accept higher alignment taxes and require international cooperation. Critics: difficulties creating accurate world models, formally specifying complex properties like “harm,” verifying highly complex systems. Maps to SR2025 guaranteed-safe-ai agenda. Bengio’s LawZero is the canonical institutional bet.

3. World Coordination

  • Global moratorium — delaying ASI by ≥10 years could reduce catastrophic risk; requires democratic discussion of trade-offs
  • Tool AI alternative — focus on specialized non-agentic systems
  • International institutions:
    • CERN-like body — research collaboration on safety challenges and pre-AGI development
    • MAGIC (Multilateral AGI Consortium) — centralized institution managing final AGI/ASI development under strict protocols
    • Intelsat model — international treaty-based governance for deployed dual-use technology

Historical precedent: Nuclear Non-Proliferation Treaty, Biological Weapons Convention, Montreal Protocol, human cloning restrictions all show catastrophic-risk cooperation is achievable.

4. Deterrence

Mutual Assured AI Malfunction (MAIM) — deterrence regime where unilateral ASI dominance attempts trigger sabotage by rivals. Resembles nuclear MAD. Creates incentives aligning national interests with global safety without requiring perfect cooperation. See mutual-assured-ai-malfunction.

Limitations of MAIM:

  • ASI development lacks clear detection thresholds (unlike nuclear)
  • Distributed/concealed training possible
  • Historical precedent: nations rarely escalate sufficiently for treaty enforcement

Yudkowsky’s extreme position: completely halt AI research, shut down GPU clusters, limit compute, enforced by military action if necessary. See eliezer-yudkowsky.

Conceptual Frameworks

Pivotal Acts

The first aligned superintelligence performs decisive actions permanently ending the acute risk period — disabling global computing infrastructure or establishing unbreakable agreements. Critics: militarizes development, contradicts democratic governance. See pivotal-act.

Pivotal Processes

Alternative: distributed coordination using aligned AI to improve human decision-making, demonstrate risks, develop better governance — preserving human agency.

The Strawberry Problem

Tests precise superintelligent control: “Can we create an AI duplicating a strawberry at the cellular level, placing both on a plate, then stopping completely?” Reveals debates about whether ASI alignment requires mathematical specification precision or pragmatic robust beneficial goals.

Connection to Wiki

This subchapter substantially deepens: