Nick Joseph

Nick Joseph is the head of training at anthropic and a co-founder of the company. He previously worked at openai before leaving to help build a more safety-focused AI lab. Joseph is one of the key architects of Anthropic’s responsible-scaling-policy (RSP), one of the most concrete institutional safety frameworks in the AI industry.

The Responsible Scaling Policy

Joseph designed the RSP around three interlocking components:

  1. Safety Levels — A hierarchy of safety levels corresponding to different degrees of model risk, where each level specifies what safeguards are required. Early models face lighter requirements while more capable models trigger increasingly stringent precautions.
  2. capability-evaluations — Evaluations run before training or deployment to assess whether a model has reached dangerous capability thresholds. Joseph distinguishes between red-line capabilities (actually dangerous; training halts until precautions are in place) and yellow-line evaluations (early warnings that flag approaching danger).
  3. Required Precautions — For each safety level, the RSP specifies precautions that must be implemented before a model at that level can be trained or deployed, creating a contractual relationship between capability and safety.

Technical Insights

A key contribution from Joseph is the distinction between a model having a dangerous capability in isolation versus being able to chain capabilities into sustained, long-running threats. He emphasizes that for an AI system to pose catastrophic risk, it typically needs to succeed consistently across many steps — connecting directly to buck-shlegeris’s compound-probability defense in the ai-control paradigm.

Industry Context

Joseph places Anthropic’s RSP alongside openai’s Preparedness Framework and deepmind’s Frontier Safety Framework. While all three share the core idea of tying safety requirements to capability levels, the RSP was among the first to be publicly articulated with specific, auditable commitments. The concept of “if-then” commitments — if the model reaches capability X, then precaution Y is required — has been influential beyond anthropic, helping to establish industry norms.