ASI Safety Strategies
Once AI vastly exceeds human capabilities (ASI — artificial superintelligence), human oversight becomes fundamentally inadequate as a safety mechanism. ASI safety presents qualitatively different challenges than AGI safety. The AI Safety Atlas (Ch.3.5) identifies four core challenges and four strategic approaches.
Why ASI Safety Is Different
Four challenges that distinguish ASI from AGI:
- Human oversight inadequacy — “we lose our ability to evaluate their reasoning, verify their outputs, or provide meaningful feedback.” Alignment cannot rely on human judgment.
- The one-shot requirement — may only get one chance before system is too capable to contain. Contested by takeoff-speed assumptions — gradual takeoff allows iteration.
- Permanent value preservation — recursive self-improvement may rewrite core algorithms; alignment must survive.
- Civilizational-scale control — ASI’s enormous capability requires preserving human agency across long-term trajectories.
Four Strategic Approaches
1. Automate Alignment Research
The OpenAI superalignment plan: delegate alignment research to advanced AI. Three components:
- Train AI using human feedback (RLHF-like)
- Develop AI to assist human evaluation of complex tasks
- Build LMs producing human-level alignment research
Differential acceleration — maximize alignment research impact while minimizing capability acceleration. Cyborgism enhances by training specialized humans to guide base LMs through prompt engineering.
Critique: the “slop” risk — early TAI may produce flawed-but-plausible alignment solutions; labs accept due to verification difficulty + AI sycophancy + organizational pressure → build misaligned ASI on flawed foundations. (See atlas-ch3-strategies-04-agi-safety-strategies.)
2. Safety-by-Design
Deep learning may have “potentially unpatchable failure modes.” Build ASI with inherent safety properties through formal methods or architectural constraints.
- Guaranteed Safe AI (GSAI) — formal world models, safety specifications, verification mechanisms
- Provably safe systems — mathematical proofs as safety cornerstones; potentially proof-carrying code rather than deep learning
- Learning-theoretic agendas — alternative formal alignment frameworks (the SR2025 agenda)
- Scientist AI — non-agentic models that accelerate science without pursuing goals — Bengio’s LawZero is the canonical institutional bet
These accept higher alignment taxes and require international cooperation. Critics: difficulty creating accurate world models, formally specifying complex properties like “harm,” verifying highly complex systems.
3. World Coordination
- Global moratorium — delaying ASI by ≥10 years could reduce catastrophic risk; requires democratic discussion of trade-offs
- Tool AI alternative — focus on specialized non-agentic systems (medical, weather)
- International institutions:
- CERN-like body — research collaboration on safety challenges
- MAGIC (Multilateral AGI Consortium) — centralized institution managing final AGI/ASI development under strict protocols
- Intelsat model — international treaty-based governance for deployed dual-use technology
Historical precedent: Nuclear Non-Proliferation Treaty, Biological Weapons Convention, Montreal Protocol all show catastrophic-risk cooperation is achievable.
4. Deterrence
Mutual Assured AI Malfunction (MAIM) — deterrence regime where unilateral ASI dominance attempts trigger sabotage by rivals. Resembles nuclear MAD.
Limitations: ASI development lacks clear detection thresholds; distributed/concealed training possible; nations rarely escalate sufficiently.
Yudkowsky’s extreme position: completely halt AI research, shut down GPU clusters, limit compute, enforced by military action if necessary.
Conceptual Frameworks
Pivotal Acts
First aligned ASI performs decisive actions permanently ending the acute risk period — disabling global computing, establishing unbreakable agreements. Critics: militarizes development, contradicts democratic governance.
Pivotal Processes
Alternative: distributed coordination using aligned AI to improve human decision-making and governance — preserving human agency.
The Strawberry Problem
“Can we create an AI duplicating a strawberry at the cellular level, placing both on a plate, then stopping completely?” Tests whether ASI alignment requires mathematical specification precision or pragmatic robust beneficial goals.
Philosophical Layer
ASI safety strategy cannot avoid philosophical questions covered in atlas-ch3-strategies-09-appendix-long-term-questions:
- What values? — see coherent-extrapolated-volition (CEV/CAV/CBV)
- Whose values? — see alignment-to-whom (single-single/single-multi/multi-single/multi-multi)
- Survival vs. flourishing? — MacAskill’s viatopia framing
- Worthy successor? — Faggella, Sutton’s positions
Connection to Wiki
- superalignment — the canonical “automate alignment research” strategy
- lawzero — Bengio’s safety-by-design organization
- guaranteed-safe-ai, scientist-ai — SR2025 agendas
- mutual-assured-ai-malfunction, pivotal-act — new concept pages
- differential-development — overlapping strategic frame
- ai-governance — the implementation layer
- atlas-ch3-strategies-05-asi-safety-strategies — primary source
Related Pages
- ai-safety-atlas-textbook
- superalignment
- lawzero
- guaranteed-safe-ai
- scientist-ai
- mutual-assured-ai-malfunction
- pivotal-act
- differential-development
- ai-governance
- coherent-extrapolated-volition
- alignment-to-whom
- atlas-ch3-strategies-05-asi-safety-strategies
- atlas-ch3-strategies-09-appendix-long-term-questions
- eliezer-yudkowsky
- yoshua-bengio
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- AI Safety Atlas Ch.3 — AGI Safety Strategies — referenced as
[[atlas-ch3-strategies-04-agi-safety-strategies]] - AI Safety Atlas Ch.3 — Appendix: Long-term Questions — referenced as
[[atlas-ch3-strategies-09-appendix-long-term-questions]] - AI Safety Atlas Ch.3 — ASI Safety Strategies — referenced as
[[atlas-ch3-strategies-05-asi-safety-strategies]]