AI Safety Atlas Ch.3 — ASI Safety Strategies

Source: ASI Safety Strategies

Once AI vastly exceeds human capabilities, human oversight becomes fundamentally inadequate. ASI safety presents qualitatively different challenges than AGI safety. See asi-safety-strategies.

Four Core Challenges

Human oversight inadequacy — “we lose our ability to evaluate their reasoning, verify their outputs, or provide meaningful feedback.” Alignment cannot rely on human judgment as safety mechanism.
The one-shot requirement — may only get one chance before system is too capable to contain. Contested by takeoff-speed assumptions — gradual development would allow iteration.
Permanent value preservation — superintelligent systems may recursively self-improve, rewriting core algorithms. Alignment must survive modification by intelligence vastly exceeding human-level.
Civilizational-scale control — ASI’s enormous capability requires preserving human agency across long-term trajectories.

Four Strategic Approaches

1. Automate Alignment Research

OpenAI’s Superalignment plan: delegate alignment research to advanced AI. Three components:

Train AI using human feedback (RLHF-like)
Develop AI to assist human evaluation of complex tasks
Build LMs producing human-level alignment research

Relies on differential acceleration — maximizing alignment research impact while minimizing capability acceleration. Cyborgism enhances by training specialized humans to guide base LMs.

Criticisms: blind spots in the plan; verification difficulty for AI-proposed alignment solutions (“slop” risk from Ch.3.4).

This connects directly to the wiki’s existing superalignment page.

2. Safety-by-Design

Deep learning may have “potentially unpatchable failure modes.” Build ASI with inherent safety properties through formal methods or architectural constraints.

Guaranteed Safe AI (GSAI) — formal world models, safety specifications, verification
Provably safe systems — mathematical proofs as safety cornerstone (proof-carrying code, not deep learning)
Learning-theoretic agendas — alternative formal alignment frameworks

These accept higher alignment taxes and require international cooperation. Critics: difficulties creating accurate world models, formally specifying complex properties like “harm,” verifying highly complex systems. Maps to SR2025 guaranteed-safe-ai agenda. Bengio’s LawZero is the canonical institutional bet.

3. World Coordination

Global moratorium — delaying ASI by ≥10 years could reduce catastrophic risk; requires democratic discussion of trade-offs
Tool AI alternative — focus on specialized non-agentic systems
International institutions:
- CERN-like body — research collaboration on safety challenges and pre-AGI development
- MAGIC (Multilateral AGI Consortium) — centralized institution managing final AGI/ASI development under strict protocols
- Intelsat model — international treaty-based governance for deployed dual-use technology

Historical precedent: Nuclear Non-Proliferation Treaty, Biological Weapons Convention, Montreal Protocol, human cloning restrictions all show catastrophic-risk cooperation is achievable.

4. Deterrence

Mutual Assured AI Malfunction (MAIM) — deterrence regime where unilateral ASI dominance attempts trigger sabotage by rivals. Resembles nuclear MAD. Creates incentives aligning national interests with global safety without requiring perfect cooperation. See mutual-assured-ai-malfunction.

Limitations of MAIM:

ASI development lacks clear detection thresholds (unlike nuclear)
Distributed/concealed training possible
Historical precedent: nations rarely escalate sufficiently for treaty enforcement

Yudkowsky’s extreme position: completely halt AI research, shut down GPU clusters, limit compute, enforced by military action if necessary. See eliezer-yudkowsky.

Conceptual Frameworks

Pivotal Acts

The first aligned superintelligence performs decisive actions permanently ending the acute risk period — disabling global computing infrastructure or establishing unbreakable agreements. Critics: militarizes development, contradicts democratic governance. See pivotal-act.

Pivotal Processes

Alternative: distributed coordination using aligned AI to improve human decision-making, demonstrate risks, develop better governance — preserving human agency.

The Strawberry Problem

Tests precise superintelligent control: “Can we create an AI duplicating a strawberry at the cellular level, placing both on a plate, then stopping completely?” Reveals debates about whether ASI alignment requires mathematical specification precision or pragmatic robust beneficial goals.

Connection to Wiki

This subchapter substantially deepens:

superalignment — the automate-alignment-research strategy
differential-development — safety-by-design and ASI-relevant differential acceleration
ai-governance — the international-institutions models
lawzero — Bengio’s safe-by-design org
New concept pages: asi-safety-strategies, mutual-assured-ai-malfunction, pivotal-act

AI Safety Compendium

Explorer

AI Safety Atlas Ch.3 — ASI Safety Strategies

AI Safety Atlas Ch.3 — ASI Safety Strategies

Four Core Challenges

Four Strategic Approaches

1. Automate Alignment Research

2. Safety-by-Design

3. World Coordination

4. Deterrence

Conceptual Frameworks

Pivotal Acts

Pivotal Processes

The Strawberry Problem

Connection to Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety Atlas Ch.3 — ASI Safety Strategies

AI Safety Atlas Ch.3 — ASI Safety Strategies

Four Core Challenges

Four Strategic Approaches

1. Automate Alignment Research

2. Safety-by-Design

3. World Coordination

4. Deterrence

Conceptual Frameworks

Pivotal Acts

Pivotal Processes

The Strawberry Problem

Connection to Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks