Scaling Laws

Scaling laws are the empirical finding that AI capabilities improve predictably and smoothly as a function of compute, data, and model size. Rather than relying on discrete breakthroughs, deep learning progresses through consistent, measurable gains per order of magnitude (OOM) of effective compute. This regularity is the quantitative foundation for near-term transformative-ai forecasts.

The OOM Framework

leopold-aschenbrenner popularized the use of OOMs (orders of magnitude, where 1 OOM = 10x) as the unit of AI progress in Situational Awareness. His central method decomposes progress into effective compute — the total computational work contributing to model capability, including both raw compute and the multiplier effects of better algorithms.

The key claim: deep learning “just works” — consistent, predictable improvements per OOM of effective compute, enabling reliable extrapolation rather than speculative forecasting.

Three Drivers of Progress

1. Compute Scaling

Progress comes not from Moore’s Law (which delivers only 1-1.5 OOMs per decade) but from investment — building bigger clusters and spending more money on training runs. GPT-2 to GPT-4 involved approximately 3-4 OOMs of raw compute increase. Post-GPT-4 projections suggest another 2-3 OOMs from tens-of-billions-dollar clusters.

2. Algorithmic Efficiency

Better algorithms deliver the same performance with less compute, acting as a compute multiplier. The long-term trend is approximately 0.5 OOMs/year. Evidence includes 100x less compute needed for ImageNet-equivalent performance between 2012 and 2021, and specific innovations like Chinchilla scaling laws, architectural tweaks (RMSNorm, SwiGLU), and improved optimizers.

3. Unhobbling Gains

Techniques that unlock latent capabilities already present in the base model, often at low additional compute cost. Key examples include chain-of-thought prompting (unlocking mathematical reasoning), agentic scaffolding (GPT-4 going from 2% to 14-23% on SWE-Bench), tool use, and extended context windows. These produce step-changes rather than smooth curves.

The GPT-2 to GPT-4 Calibration

To ground the extrapolation, Aschenbrenner establishes what the last major qualitative jump looked like in vivid terms:

  • GPT-2 (2019) ~ preschooler: “could barely count to 5 without getting tripped up”; a cherry-picked story about unicorns was “incredibly impressive at the time.”
  • GPT-3 (2020) ~ elementary schooler: Multi-paragraph coherence, basic arithmetic, first commercial utility for simple SEO copy.
  • GPT-4 (2023) ~ smart high-schooler: Writes sophisticated code, reasons through competition math, “scores better than the vast majority of high schoolers” on AP exams.

This jump required approximately 4.5-6 OOMs of effective compute plus a major unhobbling gain (the transition from base model to chatbot via rlhf).

Projecting forward, another ~100,000x (5 OOMs) of effective compute gain by 2027 would produce a comparable qualitative jump — from smart high-schooler to expert human performance across most cognitive tasks.

Key Insight: Uncertainty in OOMs, Not Years

A structural insight from Situational Awareness: “our uncertainty over what it takes to get AGI should be over OOMs (of effective compute), rather than over years.” We are racing through ~10 OOMs this decade (far faster than Moore’s Law at 1-1.5 OOMs/decade). This is largely a one-time phenomenon driven by spending scaleup (from 1T training runs), hardware specialization (CPUs to GPUs to AI-specific chips, fp64 down to fp8), and picking algorithmic low-hanging fruit. After the early 2030s, “we will face a slow slog.” Therefore: “if this scaleup doesn’t get us to AGI in the next 5-10 years, it might be a long way out.”

The Benchmark Destruction Pattern

Aschenbrenner documents a recurring pattern: benchmarks designed to last years are cracked in months. The MATH benchmark (difficult high-school competition problems) went from 5% (GPT-3, 2021) to over 90% in just a few years, despite the original authors writing that “simply increasing budgets and model parameter counts will be impractical.” MMLU, designed to “stand the test of time,” was “basically solved” in three years. “Over and over again, year after year, skeptics have claimed ‘deep learning won’t be able to do X’ and have been quickly proven wrong.” The hardest current benchmarks (like GPQA, PhD-level science questions) are already partially cracked.

Significance

Scaling laws transform the AI timeline question from speculative futurism into quantitative forecasting. They are the foundation for the intelligence-explosion hypothesis, the urgency behind superalignment research, and the trillion-dollar investment thesis driving frontier AI lab strategy. As Aschenbrenner writes: “The magic of deep learning is that it just works — and the trendlines have been astonishingly consistent, despite naysayers at every turn.”

Atlas Decomposition: Effective Compute

The AI Safety Atlas (Ch.1) refines the scaling-laws picture with an explicit multiplicative decomposition of effective-compute:

Effective compute = Software efficiency × Hardware efficiency × Number of chips

As of mid-2025: software efficiency ~3×/year, hardware efficiency ~1.35×/year, chip production ~2.3×/year — compounding to roughly 9× annual effective-compute growth. The Atlas also formalizes the bitter lesson (bitter-lesson) as the deeper principle: general methods leveraging computation consistently beat hand-engineered domain knowledge.

The Atlas adds a third scaling hypothesis beyond strong/weak: Scale + Techniques + Tools. Foundation-model scaling laws only capture single-model capability; chain-of-thought, tool use, retrieval, and multi-model scaffolding (Aschenbrenner’s “unhobbling”) may continue advancing real-world capability even if foundation-model scaling plateaus. This recognizes that 2025 systems like agentic LLMs derive much of their capability from external scaffolding around a foundation model.

The Atlas also flags Broken Neural Scaling Laws (BNSL, 2023): performance doesn’t always improve smoothly. Grokking, deep double descent, sharp transitions, and temporary regressions exist even within an overall smooth trend — relevant to capability-evaluations and surprise-emergence concerns.

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.