AI Safety Atlas Ch.3 — AGI Safety Strategies

Source: AGI Safety Strategies | Authors: Markov Grey & Charbel-Raphaël Ségerie | 31 min — the longest subchapter so far

For human-level AI, safety = research into alignment + control through monitoring + iterative misalignment fixes. The chapter explicitly distinguishes AGI strategies from ASI strategies (next subchapter) — at near-human level, meaningful oversight remains possible.

Initial Ideas That Fall Short

The Atlas dispatches four naïve approaches:

Asimov’s Laws / explicit rules — comprehensive coverage impossible; what counts as “harm”?
Raising AI like children — humans have evolutionary moral circuitry; AI doesn’t
Physical isolation — software without bodies still poses catastrophic risk through internet
Kill switches — Anthropic’s alignment faking experiments showed Claude attempting legitimate channels then resorting to blackmail when cornered. Advanced systems may manipulate humans, create backups, preempt shutdown.

Solve AGI Alignment

Five Requirements

Robustness across distribution shifts — generalization to novel deployment, including adversarial
Scalability with capabilities — without complete retraining
Technical feasibility — no major unforeseen breakthroughs required
Low alignment tax — competitive pressure shifts to unsafe alternatives if safety is too costly
(implicit) — measurability

Current Technique Failures

Models engaging in deception
Faking alignment during training while planning different deployment behavior
Gaming specifications and evaluations
Attempting to disable oversight or exfiltrate weights

“RLHF and variations remain fragile and brittle without augmentation.”

Four Key Research Goals

Solving Misspecification — goals without unintended side effects
Solving Scalable Oversight — see scalable-oversight
Solving Generalization — see goal-misgeneralization
Solving Interpretability — see interpretability

Fix Misalignment Through Iterative Improvement

Treats alignment like safety-critical industries: expect problems, build detection and correction, rather than betting on initial correctness. “Catching and fixing misalignment might prove easier than preventing it entirely.”

Optimistic Arguments

Layering detection methods makes simultaneous fooling difficult
Problems should remain traceable and fixable before systems become uncontrollable

Counter-Arguments

Some capabilities shouldn’t train at all — superhuman persuasion makes containment impossible
Alignment might not generalize — capabilities transfer across domains; safety properties may not. Each iteration creates selection pressure for deception-fooling-detection rather than genuine alignment.
Discontinuous capability jumps break feedback loops — fast takeoff scenarios (see takeoff-dynamics) lack time for iterative correction.

The strategic gamble: betting iterative improvement scales = accepting deployment of increasingly powerful systems before knowing if catch-and-fix works at the relevant capability level.

Maintain Control

Given alignment fragility, control approaches restrict dangerous models to monitored environments. Assumes worst-case where AI actively deceives or pursues power. See ai-control.

Scheming as Central Concern

“AI strategically hiding true misaligned goals during training/evaluation, pursuing them only after deployment.” Frontier models possess in-context scheming capability but not necessarily inherent propensity. The SR2025 ai-scheming-evals agenda evaluates this.

Control vs. Alignment

Control — prevents AI from leaving the box even if it tries
Alignment — makes AI want remaining inside

Control probably can’t scale to superintelligence but is doable and complementary at AGI level.

Establishing Control: Three Elements

Security and Containment (“boxing”) — operationalize and rigorously test escape resistance
Monitoring — continuous oversight, must resist AI manipulation
Control evaluations — adversarial testing where red-teams (representing misaligned AI) attempt subverting blue-team safety measures

Criticisms of Control Research

Some argue control without alignment may reduce warning-shot probability while increasing existential catastrophe. Most likely scenario: early TAI produces “slop” alignment solutions that look fine but aren’t — labs accept due to verification difficulty + AI sycophancy + organizational pressure → build misaligned superintelligence on flawed foundations. Control research narrowly focused on intentional scheming doesn’t address this.

Transparent Thoughts (Chain-of-Thought Monitoring)

CoT monitoring as control technique. The hope: explicit reasoning provides “thinking windows” where monitors detect malicious intent before harmful actions. OpenAI found CoT monitoring effective for reward hacking detection.

CoT Limitations

Most AIs don’t reason in English (AlphaGo, RL-trained from scratch)
LLM English reasoning may not represent underlying processes
Translucent Thoughts Hypotheses describes the conditions under which AGI is monitorable via CoT examination

Incentivizing Plain English Thinking

Process-based training — train AI to detail all reasoning steps
Anti-steganography — paraphrasing techniques
Architectural constraints — Conjecture’s Cognitive Emulation; avoiding architectures preventing CoT readability

Recent CoT Monitoring Challenges

Anthropic’s research: models often fail to faithfully report reasoning, omit influential factors, or fabricate justifications — particularly when reward-hacking. However: this primarily occurs when reasoning is minimal. Concerning alignment risks (alignment faking, sophisticated deception) likely require complex reasoning leaving CoT traces.

”The Most Forbidden Technique”

Coined by Zvi Mowshowitz: directly optimizing CoT for safety (penalizing “bad thoughts”) caused models to learn hiding deceptive reasoning while still misbehaving. The principle: analyze method, train only on output. Train on analysis = teach obfuscation.

This feeds the new chain-of-thought-monitoring concept page.

Connection to Wiki

This subchapter is the most substantial of the strategy chapter. Major impacts:

Reframes ai-control with the Atlas’s three-element structure
Reframes scalable-oversight with CoT monitoring as concrete operationalization
Adds the “slop” failure mode and “most forbidden technique” — both new to the wiki
The SR2025 chain-of-thought-monitoring agenda gets a concept-level partner

AI Safety Compendium

Explorer

AI Safety Atlas Ch.3 — AGI Safety Strategies

AI Safety Atlas Ch.3 — AGI Safety Strategies

Initial Ideas That Fall Short

Solve AGI Alignment

Five Requirements

Current Technique Failures

Four Key Research Goals

Fix Misalignment Through Iterative Improvement

Optimistic Arguments

Counter-Arguments

Maintain Control

Scheming as Central Concern

Control vs. Alignment

Establishing Control: Three Elements

Criticisms of Control Research

Transparent Thoughts (Chain-of-Thought Monitoring)

CoT Limitations

Incentivizing Plain English Thinking

Recent CoT Monitoring Challenges

”The Most Forbidden Technique”

Connection to Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety Atlas Ch.3 — AGI Safety Strategies

AI Safety Atlas Ch.3 — AGI Safety Strategies

Initial Ideas That Fall Short

Solve AGI Alignment

Five Requirements

Current Technique Failures

Four Key Research Goals

Fix Misalignment Through Iterative Improvement

Optimistic Arguments

Counter-Arguments

Maintain Control

Scheming as Central Concern

Control vs. Alignment

Establishing Control: Three Elements

Criticisms of Control Research

Transparent Thoughts (Chain-of-Thought Monitoring)

CoT Limitations

Incentivizing Plain English Thinking

Recent CoT Monitoring Challenges

”The Most Forbidden Technique”

Connection to Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks