AI Safety Atlas Ch.3 — ASI Safety Strategies
Source: ASI Safety Strategies
Once AI vastly exceeds human capabilities, human oversight becomes fundamentally inadequate. ASI safety presents qualitatively different challenges than AGI safety. See asi-safety-strategies.
Four Core Challenges
- Human oversight inadequacy — “we lose our ability to evaluate their reasoning, verify their outputs, or provide meaningful feedback.” Alignment cannot rely on human judgment as safety mechanism.
- The one-shot requirement — may only get one chance before system is too capable to contain. Contested by takeoff-speed assumptions — gradual development would allow iteration.
- Permanent value preservation — superintelligent systems may recursively self-improve, rewriting core algorithms. Alignment must survive modification by intelligence vastly exceeding human-level.
- Civilizational-scale control — ASI’s enormous capability requires preserving human agency across long-term trajectories.
Four Strategic Approaches
1. Automate Alignment Research
OpenAI’s Superalignment plan: delegate alignment research to advanced AI. Three components:
- Train AI using human feedback (RLHF-like)
- Develop AI to assist human evaluation of complex tasks
- Build LMs producing human-level alignment research
Relies on differential acceleration — maximizing alignment research impact while minimizing capability acceleration. Cyborgism enhances by training specialized humans to guide base LMs.
Criticisms: blind spots in the plan; verification difficulty for AI-proposed alignment solutions (“slop” risk from Ch.3.4).
This connects directly to the wiki’s existing superalignment page.
2. Safety-by-Design
Deep learning may have “potentially unpatchable failure modes.” Build ASI with inherent safety properties through formal methods or architectural constraints.
- Guaranteed Safe AI (GSAI) — formal world models, safety specifications, verification
- Provably safe systems — mathematical proofs as safety cornerstone (proof-carrying code, not deep learning)
- Learning-theoretic agendas — alternative formal alignment frameworks
These accept higher alignment taxes and require international cooperation. Critics: difficulties creating accurate world models, formally specifying complex properties like “harm,” verifying highly complex systems. Maps to SR2025 guaranteed-safe-ai agenda. Bengio’s LawZero is the canonical institutional bet.
3. World Coordination
- Global moratorium — delaying ASI by ≥10 years could reduce catastrophic risk; requires democratic discussion of trade-offs
- Tool AI alternative — focus on specialized non-agentic systems
- International institutions:
- CERN-like body — research collaboration on safety challenges and pre-AGI development
- MAGIC (Multilateral AGI Consortium) — centralized institution managing final AGI/ASI development under strict protocols
- Intelsat model — international treaty-based governance for deployed dual-use technology
Historical precedent: Nuclear Non-Proliferation Treaty, Biological Weapons Convention, Montreal Protocol, human cloning restrictions all show catastrophic-risk cooperation is achievable.
4. Deterrence
Mutual Assured AI Malfunction (MAIM) — deterrence regime where unilateral ASI dominance attempts trigger sabotage by rivals. Resembles nuclear MAD. Creates incentives aligning national interests with global safety without requiring perfect cooperation. See mutual-assured-ai-malfunction.
Limitations of MAIM:
- ASI development lacks clear detection thresholds (unlike nuclear)
- Distributed/concealed training possible
- Historical precedent: nations rarely escalate sufficiently for treaty enforcement
Yudkowsky’s extreme position: completely halt AI research, shut down GPU clusters, limit compute, enforced by military action if necessary. See eliezer-yudkowsky.
Conceptual Frameworks
Pivotal Acts
The first aligned superintelligence performs decisive actions permanently ending the acute risk period — disabling global computing infrastructure or establishing unbreakable agreements. Critics: militarizes development, contradicts democratic governance. See pivotal-act.
Pivotal Processes
Alternative: distributed coordination using aligned AI to improve human decision-making, demonstrate risks, develop better governance — preserving human agency.
The Strawberry Problem
Tests precise superintelligent control: “Can we create an AI duplicating a strawberry at the cellular level, placing both on a plate, then stopping completely?” Reveals debates about whether ASI alignment requires mathematical specification precision or pragmatic robust beneficial goals.
Connection to Wiki
This subchapter substantially deepens:
- superalignment — the automate-alignment-research strategy
- differential-development — safety-by-design and ASI-relevant differential acceleration
- ai-governance — the international-institutions models
- lawzero — Bengio’s safe-by-design org
- New concept pages: asi-safety-strategies, mutual-assured-ai-malfunction, pivotal-act