Summary: 80,000 Hours Podcast — Nick Joseph on Anthropic’s Safety Approach
Overview
In episode #197 of the 80,000 Hours Podcast, Nick Joseph — head of training at anthropic and a co-founder who left OpenAI — explains Anthropic’s Responsible Scaling Policy (RSP), a framework for evaluating and mitigating risks from increasingly capable AI models. The RSP represents one of the most concrete institutional safety frameworks in the industry, defining specific safety levels, evaluations, and required precautions before models can be trained or deployed.
The Responsible Scaling Policy
The RSP is structured around three interlocking components:
1. Safety Levels
Anthropic defines a hierarchy of safety levels corresponding to different degrees of model risk. Each level specifies what a model at that risk tier could potentially do and what safeguards are required. This graduated approach allows safety requirements to scale with capability — early models face lighter requirements, while more capable models trigger increasingly stringent precautions.
2. Capability Evaluations
Before training or deploying a model, Anthropic runs evaluations designed to assess whether the model has reached dangerous capability thresholds. These evaluations are not just performance benchmarks; they specifically test for capabilities that could enable catastrophic harm.
Joseph distinguishes between two types of capability thresholds:
- Red-line capabilities — The actually dangerous capabilities. Anthropic will not train a model that has these capabilities until the next set of precautions is in place.
- Yellow-line evaluations — Early warning tests that flag when a model is approaching (but has not yet reached) dangerous capability levels, giving the team time to prepare mitigations.
3. Required Precautions
For each safety level, the RSP specifies the precautions that must be implemented before a model at that level can be trained or deployed. This creates a contractual relationship between capability and safety: scaling up requires demonstrated safety measures.
Chaining and Consistency
An important technical insight from the episode is the distinction between a model having a dangerous capability in isolation versus being able to chain capabilities into sustained, long-running threats. Joseph emphasizes that for an AI system to pose catastrophic risk, it typically needs to succeed consistently across many steps. A model that can occasionally produce dangerous outputs is far less concerning than one that can reliably execute multi-step plans. This connects directly to buck-shlegeris’s work on ai-control and the compound-probability defense against multi-step attacks.
Comparison to Other Frameworks
Joseph places Anthropic’s RSP in context alongside:
- OpenAI’s Preparedness Framework — A similar evaluation-based approach to managing risk from frontier models.
- DeepMind’s Frontier Safety Framework — Google DeepMind’s version of graduated safety requirements.
While all three frameworks share the core idea of tying safety requirements to capability levels, the RSP was among the first to be publicly articulated with specific, auditable commitments.
The Anthropic Safety Culture
Beyond the formal policy, Joseph discusses Anthropic’s organizational culture around safety. As a company that was founded explicitly to do AI safety research while also building competitive models, Anthropic faces the fundamental tension between commercial pressure and safety commitments. Joseph’s perspective — as someone who left OpenAI to help build a more safety-focused company — provides insight into how institutional structures can either support or undermine safety work.
Significance
The Responsible Scaling Policy has been influential beyond Anthropic, helping to establish the norm that AI companies should have explicit, public frameworks tying safety requirements to capability levels. The concept of “if-then” commitments (if the model reaches capability X, then precaution Y is required) provides a concrete alternative to vaguer promises about “taking safety seriously.”
The episode is also valuable as a window into the practical challenges of implementing safety at a frontier AI lab — balancing the desire for rigor with the need for organizational flexibility as capabilities and risks evolve rapidly.