Summary: Situational Awareness Ch. I — From GPT-4 to AGI: Counting the OOMs

This is Chapter I of Situational Awareness by leopold-aschenbrenner. It makes the case that AGI by 2027 is “strikingly plausible” by decomposing AI progress into three measurable drivers — compute scaling, algorithmic efficiency, and “unhobbling” — and extrapolating each forward from known trendlines. The source is available at situational-awareness.ai/from-gpt-4-to-agi.

The OOM Framework

OOM stands for “order of magnitude” (10x = 1 OOM, 3x = 0.5 OOM). Aschenbrenner’s central method is to measure AI progress in OOMs of effective compute — the total computational work that contributes to model capability, including both raw compute and the multiplier effects of better algorithms and training techniques. This allows him to treat progress as a quantifiable trend rather than a series of unpredictable breakthroughs.

The key claim: deep learning “just works” — consistent, predictable improvements per OOM of effective compute, enabling reliable extrapolation.

The GPT-2 to GPT-4 Baseline

To calibrate the extrapolation, Aschenbrenner establishes what the last major qualitative jump looked like:

GPT-2 (2019) performed at roughly a preschooler level — impressive for stringing plausible sentences together (the famous cherry-picked unicorn story), but barely able to count to 5.
GPT-4 (2023) performed at a smart high-schooler level — acing college exams, writing code and essays, reasoning through difficult math. Capabilities that most experts considered “impenetrable walls” just a few years prior.

This preschooler-to-high-schooler jump was achieved through approximately 4.5-6 OOMs of base effective compute plus a major unhobbling gain (the transition from base model to chatbot via RLHF/instruction tuning).

The Three Drivers of Progress

1. Compute Scaling

This is not driven by Moore’s Law (which delivers only 1-1.5 OOMs per decade). It is driven by investment — building bigger clusters and spending more money on training runs. What once cost millions now costs billions.

GPT-2 to GPT-4: approximately 3-4 OOMs of raw compute increase (per Epoch AI estimates). GPT-2 to GPT-3 came from datacenter scaleup; GPT-3 to GPT-4 came from building a new, bigger cluster.
Post-GPT-4 (2023-2027): +2 OOMs is likely ( $t e n so f bi ll i o n sc l u s t ers), + 3 OOM s i s pl a u s ib l e ($ 100B+, with rumored Microsoft/OpenAI projects at this scale).

2. Algorithmic Efficiencies

Better algorithms deliver the same performance with less compute, effectively acting as a compute multiplier. The long-term trend is approximately 0.5 OOMs/year of compute efficiency gains.

Evidence:

ImageNet benchmark: 100x less compute needed for the same performance between 2012 and 2021 (roughly 4 years of gains compressed).
API pricing: GPT-4 API costs were similar to GPT-3 at launch, despite vastly superior performance — suggesting approximately half of the GPT-3-to-GPT-4 effective compute gain came from algorithmic improvements.
Specific innovations: Chinchilla scaling laws, post-Transformer architectural tweaks (RMSNorm, SwiGLU, AdamW) delivered approximately 6x gains at small scale.

GPT-2 to GPT-4: 1-2 OOMs of algorithmic gains. Extrapolating to 2027: approximately +2 OOMs over 4 years.

3. Unhobbling Gains

“Unhobbling” refers to techniques that unlock latent capabilities already present in the base model, often at relatively low additional compute cost. These produce step-changes rather than smooth curves.

Key examples of past unhobbling:

Chain-of-thought (CoT) prompting: Unlocked mathematical reasoning. GPT-3.5 went from failing to passing the GSM8K math benchmark simply by being prompted to show its work. This mimics human deliberate (System 2) reasoning.
Agentic scaffolding: GPT-4 scored 2% on SWE-Bench (a software engineering benchmark) as a bare model, but 14-23% when wrapped in the Devin agent framework.
Tool use: Browser access and code execution (still early stage at time of writing).
Extended context: 100k-token context outperforms a larger model with only 4k tokens, critical for coding and document tasks.
Combined gains from posttraining and tools: GPT-4 on agentic tasks went from 5% to 40% (per Epoch AI/METR measurements), representing 5-30x effective gains.

Projected future unhobbling:

Test-time compute: Trading inference compute for training-equivalent capability. Aschenbrenner estimates +4 OOMs of test-time compute could yield approximately +3 OOMs of pretrain-equivalent performance — a GPT-3-to-GPT-4 sized jump from inference alone.
System II outer loop: Millions-of-words reasoning streams for long-horizon, multi-step projects.
Computer use: Multimodal models that control computers like humans do, becoming true ai-agents.

The Projection

Combining the three drivers:

Period	Base Compute	Algorithmic Efficiency	Unhobbling	Total Effective
GPT-2 to GPT-4	3-4 OOMs	1-2 OOMs	Chatbot (major)	~4.5-6 OOMs + chatbot
2023-2027	2-3 OOMs	~2 OOMs	Agent/drop-in worker	~3-6 OOMs (~5 median) + agent

The median projection of approximately 5 OOMs of effective compute gain from 2023 to 2027 matches the GPT-2-to-GPT-4 jump that took AI from preschooler to high-schooler. Applied to GPT-4’s high-schooler baseline, a comparable jump points toward systems that match or exceed expert human performance across most cognitive tasks — i.e., AGI.

Key Insight: Uncertainty Is in OOMs, Not Years

A critical framing: the uncertainty about when AGI arrives is measured in orders of magnitude of compute, not in calendar years. Because effective compute is growing at approximately 5 OOMs per 4 years (versus Moore’s Law pace of 1-1.5 OOMs per decade), even large uncertainties in the compute required for AGI translate to relatively small uncertainties in timing. If AGI requires 2 more OOMs than expected, that is less than 2 additional years at current pace — not the decades that Moore’s Law alone would imply.

Significance for This Wiki

This chapter provides the quantitative foundation for the transformative-ai thesis that both Situational Awareness and AI 2027 build upon. The OOM-counting methodology makes the “AGI by 2027” claim legible and debatable — it is not a vague prediction but a specific extrapolation from measured trendlines with identified sources of uncertainty.

AI Safety Compendium

Explorer

Summary: Situational Awareness Ch. I — From GPT-4 to AGI: Counting the OOMs

Summary: Situational Awareness Ch. I — From GPT-4 to AGI: Counting the OOMs

The OOM Framework

The GPT-2 to GPT-4 Baseline

The Three Drivers of Progress

1. Compute Scaling

2. Algorithmic Efficiencies

3. Unhobbling Gains

The Projection

Key Insight: Uncertainty Is in OOMs, Not Years

Significance for This Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Summary: Situational Awareness Ch. I — From GPT-4 to AGI: Counting the OOMs

Summary: Situational Awareness Ch. I — From GPT-4 to AGI: Counting the OOMs

The OOM Framework

The GPT-2 to GPT-4 Baseline

The Three Drivers of Progress

1. Compute Scaling

2. Algorithmic Efficiencies

3. Unhobbling Gains

The Projection

Key Insight: Uncertainty Is in OOMs, Not Years

Significance for This Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks