Summary: Situational Awareness Ch. I — From GPT-4 to AGI: Counting the OOMs
This is Chapter I of Situational Awareness by leopold-aschenbrenner. It makes the case that AGI by 2027 is “strikingly plausible” by decomposing AI progress into three measurable drivers — compute scaling, algorithmic efficiency, and “unhobbling” — and extrapolating each forward from known trendlines. The source is available at situational-awareness.ai/from-gpt-4-to-agi.
The OOM Framework
OOM stands for “order of magnitude” (10x = 1 OOM, 3x = 0.5 OOM). Aschenbrenner’s central method is to measure AI progress in OOMs of effective compute — the total computational work that contributes to model capability, including both raw compute and the multiplier effects of better algorithms and training techniques. This allows him to treat progress as a quantifiable trend rather than a series of unpredictable breakthroughs.
The key claim: deep learning “just works” — consistent, predictable improvements per OOM of effective compute, enabling reliable extrapolation.
The GPT-2 to GPT-4 Baseline
To calibrate the extrapolation, Aschenbrenner establishes what the last major qualitative jump looked like:
- GPT-2 (2019) performed at roughly a preschooler level — impressive for stringing plausible sentences together (the famous cherry-picked unicorn story), but barely able to count to 5.
- GPT-4 (2023) performed at a smart high-schooler level — acing college exams, writing code and essays, reasoning through difficult math. Capabilities that most experts considered “impenetrable walls” just a few years prior.
This preschooler-to-high-schooler jump was achieved through approximately 4.5-6 OOMs of base effective compute plus a major unhobbling gain (the transition from base model to chatbot via RLHF/instruction tuning).
The Three Drivers of Progress
1. Compute Scaling
This is not driven by Moore’s Law (which delivers only 1-1.5 OOMs per decade). It is driven by investment — building bigger clusters and spending more money on training runs. What once cost millions now costs billions.
- GPT-2 to GPT-4: approximately 3-4 OOMs of raw compute increase (per Epoch AI estimates). GPT-2 to GPT-3 came from datacenter scaleup; GPT-3 to GPT-4 came from building a new, bigger cluster.
- Post-GPT-4 (2023-2027): +2 OOMs is likely (100B+, with rumored Microsoft/OpenAI projects at this scale).
2. Algorithmic Efficiencies
Better algorithms deliver the same performance with less compute, effectively acting as a compute multiplier. The long-term trend is approximately 0.5 OOMs/year of compute efficiency gains.
Evidence:
- ImageNet benchmark: 100x less compute needed for the same performance between 2012 and 2021 (roughly 4 years of gains compressed).
- API pricing: GPT-4 API costs were similar to GPT-3 at launch, despite vastly superior performance — suggesting approximately half of the GPT-3-to-GPT-4 effective compute gain came from algorithmic improvements.
- Specific innovations: Chinchilla scaling laws, post-Transformer architectural tweaks (RMSNorm, SwiGLU, AdamW) delivered approximately 6x gains at small scale.
GPT-2 to GPT-4: 1-2 OOMs of algorithmic gains. Extrapolating to 2027: approximately +2 OOMs over 4 years.
3. Unhobbling Gains
“Unhobbling” refers to techniques that unlock latent capabilities already present in the base model, often at relatively low additional compute cost. These produce step-changes rather than smooth curves.
Key examples of past unhobbling:
- Chain-of-thought (CoT) prompting: Unlocked mathematical reasoning. GPT-3.5 went from failing to passing the GSM8K math benchmark simply by being prompted to show its work. This mimics human deliberate (System 2) reasoning.
- Agentic scaffolding: GPT-4 scored 2% on SWE-Bench (a software engineering benchmark) as a bare model, but 14-23% when wrapped in the Devin agent framework.
- Tool use: Browser access and code execution (still early stage at time of writing).
- Extended context: 100k-token context outperforms a larger model with only 4k tokens, critical for coding and document tasks.
- Combined gains from posttraining and tools: GPT-4 on agentic tasks went from 5% to 40% (per Epoch AI/METR measurements), representing 5-30x effective gains.
Projected future unhobbling:
- Test-time compute: Trading inference compute for training-equivalent capability. Aschenbrenner estimates +4 OOMs of test-time compute could yield approximately +3 OOMs of pretrain-equivalent performance — a GPT-3-to-GPT-4 sized jump from inference alone.
- System II outer loop: Millions-of-words reasoning streams for long-horizon, multi-step projects.
- Computer use: Multimodal models that control computers like humans do, becoming true ai-agents.
The Projection
Combining the three drivers:
| Period | Base Compute | Algorithmic Efficiency | Unhobbling | Total Effective |
|---|---|---|---|---|
| GPT-2 to GPT-4 | 3-4 OOMs | 1-2 OOMs | Chatbot (major) | ~4.5-6 OOMs + chatbot |
| 2023-2027 | 2-3 OOMs | ~2 OOMs | Agent/drop-in worker | ~3-6 OOMs (~5 median) + agent |
The median projection of approximately 5 OOMs of effective compute gain from 2023 to 2027 matches the GPT-2-to-GPT-4 jump that took AI from preschooler to high-schooler. Applied to GPT-4’s high-schooler baseline, a comparable jump points toward systems that match or exceed expert human performance across most cognitive tasks — i.e., AGI.
Key Insight: Uncertainty Is in OOMs, Not Years
A critical framing: the uncertainty about when AGI arrives is measured in orders of magnitude of compute, not in calendar years. Because effective compute is growing at approximately 5 OOMs per 4 years (versus Moore’s Law pace of 1-1.5 OOMs per decade), even large uncertainties in the compute required for AGI translate to relatively small uncertainties in timing. If AGI requires 2 more OOMs than expected, that is less than 2 additional years at current pace — not the decades that Moore’s Law alone would imply.
Significance for This Wiki
This chapter provides the quantitative foundation for the transformative-ai thesis that both Situational Awareness and AI 2027 build upon. The OOM-counting methodology makes the “AGI by 2027” claim legible and debatable — it is not a vague prediction but a specific extrapolation from measured trendlines with identified sources of uncertainty.