AI Safety Atlas Ch.1 — Foundation Models

Source: Foundation Models | ai-safety-atlas.com/chapters/v1/capabilities/foundation-models/ | Authors: Markov Grey & Charbel-Raphaël Ségerie | 8 min

The Paradigm Shift

Foundation models represent a fundamental shift in how AI is developed. Rather than building specialized models for each task, large-scale models are trained as a “foundation” for many applications, then specialized via fine-tuning. The analogy: the same base structure can support banks, restaurants, or housing.

The traditional approach — specialized models with human-labeled data per task — was bottlenecked by data labeling costs and inability to transfer knowledge. Foundation models broke through via self-supervised learning on massive unlabeled datasets, enabled by GPU hardware advances, transformer architectures, and the explosion of online data.

Examples spanning modalities: GPT-4, Claude (language); DALL-E 3, Stable Diffusion (vision); GPT-4V, Gemini (multimodal — LMMs); Gato (multi-task RL).

Foundation Models vs. Frontier Models

A distinction the chapter clarifies: frontier models are the cutting edge of capability in any domain; foundation models are general-purpose substrates. They overlap (Claude 3.5 Sonnet is both), but not always — AlphaFold is a frontier model in protein structure prediction but specialized for one task, not a foundation.

Training Pipeline

Foundation models follow a two-stage pattern:

Pre-training

The model learns general patterns from datasets of millions to billions of examples, with no specific task in mind. “This generality is both powerful and concerning from a safety perspective.” It enables broad adaptation but means we can’t easily predict or constrain what is learned.

Self-Supervised Learning (SSL)

The key technical innovation. Instead of human labels, SSL leverages the data’s inherent structure: hide part of an image and predict the rest; predict the next word in a sentence (“The cat sat on the …”). This learns objects, contexts, grammar, and semantics — without any human-provided labels — through repeated exposure to vast quantities of data.

Adaptation

Two main approaches after pre-training:

  • Fine-tuning — additional training on a specific task or dataset (e.g., RLHF for instruction-following).
  • Prompting / in-context learning — guiding behavior via crafted inputs without weight updates.

Safety Challenges in Training

Three structural concerns:

  1. SSL’s nature limits control over what is learned — unintended capabilities and behaviors can emerge.
  2. The adaptation process must reliably preserve safety properties established in pre-training.
  3. Massive scale makes thorough understanding or auditing difficult.

The chapter explicitly connects these training-stage concerns to downstream agendas elsewhere in the textbook: goal misgeneralization (Ch.7) and scalable-oversight (Ch.8) are “deeply connected to how these models are trained and adapted.”

Properties That Matter for Safety

Transfer Learning

Knowledge from pre-training transfers to new tasks. Both useful capabilities and harmful biases transfer — undesired behaviors can show up in unexpected applications.

Zero-shot and Few-shot Learning

Foundation models perform new tasks with very few examples or none at all. “If models can adapt to novel situations in unexpected ways, it becomes harder to predict and control their behavior in deployment.”

Cross-domain Generalization

Capabilities generalize across domains — but goals and constraints often don’t generalize alongside them. A model might generalize text manipulation in surprising ways without preserving the safety constraints that were intended. This connects directly to the wiki’s goal-misgeneralization page.

Why This Matters for Risk

The chapter’s safety thesis: foundation models introduce a class of risks that didn’t exist before. Specifically:

  • Power centralization — only large actors can train base models.
  • Homogenization — many downstream systems share the same flaws and blind spots of the same base model.
  • Dual-use capabilities — emergent capabilities expand the attack surface.
  • Misalignment rises from theoretical to immediate as capabilities expand.

“Complex capabilities, combined with generality and scale, means we need to seriously consider safety risks beyond just misuse that previously seemed theoretical or distant.”

Connection to Wiki

This subchapter justifies a dedicated foundation-models concept page in the wiki — referenced widely (in situational-awareness, ai-population-explosion, scaling-laws, etc.) but not previously a standalone page. The training-pipeline framing also clarifies how rlhf, deceptive-alignment, and goal-misgeneralization all sit in the same architecture.