Responsible Scaling Policy
Definition
A Responsible Scaling Policy (RSP) is a frontier-AI lab’s public, written commitment that ties deployment and training decisions to specific capability-evaluation thresholds: if the model reaches capability X, then the lab will not proceed without mitigation Y. The format was introduced by Anthropic, Responsible Scaling Policy (current version PDF) and adopted in parallel by OpenAI’s Preparedness Framework and DeepMind’s Frontier Safety Framework.
An RSP has three interlocking components (Anthropic RSP; 80,000 Hours: Nick Joseph):
- AI Safety Levels (ASLs) — a graduated hierarchy of risk tiers, each with a description of the dangerous capabilities expected at that level.
- Capability evaluations — concrete tests that determine whether a model has crossed a tier threshold.
- Required precautions — security, deployment, and governance mitigations that must be in place before training or deploying a model at a given level.
Why it matters
RSPs are the dominant institutional safety pattern at frontier AI labs as of 2026. They convert “we take safety seriously” from cultural aspiration into procedurally-auditable commitment (Anthropic RSP; Atlas Ch.3 — Socio-Technical Strategies).
Three structural reasons they are load-bearing for current safety governance:
-
They make safety commitments falsifiable. Without a written RSP, “we’ll deploy responsibly” cannot be checked. With one, external evaluators (METR, UK AISI, US AISI) can in principle verify whether stated thresholds have been crossed and whether stated precautions are in place (Atlas Ch.4 — Governance Architectures; see atlas-ch4-governance-04-governance-architectures).
-
They create a coordination focal point. Once one lab publishes an RSP, others face pressure to publish comparable commitments. The convergence of Anthropic’s RSP, OpenAI’s Preparedness Framework, and DeepMind’s Frontier Safety Framework reflects this dynamic (80,000 Hours: Nick Joseph).
-
They connect technical safety to corporate decision-making. RSPs link the capability-evaluations research agenda directly to the lab’s training and deployment pipeline. The eval is not just academic — it gates the next training run.
Key results
-
The if-then commitment structure (Anthropic RSP). The RSP’s central innovation is the conditional structure: capability evaluation result → required mitigation. This is the if-then-commitments pattern generalized; it transforms vague safety pledges into concrete trip-wires.
-
Red-line vs. yellow-line evaluations (Anthropic RSP; 80,000 Hours: Nick Joseph). Yellow-line evals are early-warning thresholds that trigger preparation; red-line evals are trip-wires that the lab commits not to cross without specific precautions in place. The two-tier structure provides lead time to develop mitigations.
-
Frontier-lab convergence on the RSP pattern. OpenAI’s Preparedness Framework shares the eval-thresholds-and-precautions structure with explicit “Critical” and “High” risk categories (OpenAI PF). DeepMind’s Frontier Safety Framework adopts a similar tiered structure with critical capability thresholds (DeepMind FSF). The pattern is now the de facto industry baseline.
-
Capability chaining is what matters, not isolated capability. A model that can do X once is far less concerning than a model that can reliably chain capabilities across multi-step plans. Anthropic’s RSP and the 80k Joseph interview emphasize evaluating sustained capability rather than point capability — connecting directly to ai-control’s compound-probability defense (80,000 Hours: Nick Joseph).
-
RSPs are the bridge between technical alignment and AI governance. Atlas Ch.3 — Socio-Technical Strategies treats RSPs as the central instance of the lab-level governance pattern complementing alignment research. They are also the model international AI-governance proposals ([international-ai-safety-report]) most often invoke.
Open questions
-
Are current RSPs tight enough? Independent analyses (Existing Safety Frameworks Imply Unreasonable Confidence; ailabwatch.org) argue that current RSPs as written do not actually constrain dangerous deployment — the thresholds are too vague, the precautions too easy to claim. Whether RSPs can be tightened sufficiently or whether they are inherently performative is contested.
-
What happens when the lab disagrees with the eval? RSPs commit the lab to act on eval results, but the lab also runs the evals. The conflict-of-interest problem is real and only partially mitigated by external audits (Atlas Ch.4 — Governance Architectures; see atlas-ch4-governance-04-governance-architectures).
-
How do RSPs handle race dynamics? A unilateral commitment to pause training at threshold X is only effective if competitors also commit. Whether the RSP framework has any answer beyond “voluntary norms” is an open governance question (80,000 Hours: Nick Joseph).
-
Are RSP thresholds keeping up with capabilities? Frontier models have been racing past prior thresholds; the question is whether the RSP review-and-update cadence is fast enough to remain meaningful. Anthropic publishes revisions, but the frontier moves fast (Anthropic RSP changelog).
-
Should RSPs be voluntary or mandatory? The current pattern is voluntary public commitments. Mandatory regulatory equivalents (EU AI Act, US executive orders) are emerging but not yet aligned with the RSP structure (Atlas Ch.4).
Related agendas
- capability-evals — the research agenda producing the eval methodologies RSPs depend on.
- ai-deception-evals, ai-scheming-evals, autonomy-evals, wmd-evals-weapons-of-mass-destruction — the specific eval families that populate RSP thresholds.
- various-redteams — adversarial probing that complements automated evals.
- control — the operational layer that makes RSP commitments meaningful when alignment cannot be guaranteed.
- safeguards-inference-time-auxiliaries — runtime mitigations that count as RSP “required precautions” at higher levels.
Related concepts
- capability-evaluations — the trigger mechanism inside an RSP.
- frontier-safety-frameworks — the broader category Anthropic’s RSP, OpenAI’s PF, and DeepMind’s FSF share.
- ai-safety-levels — the threshold hierarchy RSPs use.
- if-then-commitments — the commitment pattern generalized.
- ai-governance — RSPs are the dominant lab-level governance instrument.
- ai-risk-management — the broader risk-management discipline RSPs instantiate.
- ai-control — RSPs increasingly require control protocols at higher safety levels.
- scaling-laws — the underlying capability dynamic RSPs are designed to track.
- ai-safety-culture — RSPs embed safety expectations into corporate process.
- socio-technical-strategies — RSPs are the canonical socio-technical safety strategy.
Related Pages
- capability-evaluations
- frontier-safety-frameworks
- ai-safety-levels
- if-then-commitments
- ai-governance
- ai-risk-management
- ai-control
- scaling-laws
- ai-safety-culture
- socio-technical-strategies
- ai-safety
- ai-alignment
- ai-takeover-scenarios
- information-security
- anthropic
- openai
- deepmind
- metr
- ai-safety-institute
- capability-evals
- ai-deception-evals
- ai-scheming-evals
- autonomy-evals
- wmd-evals-weapons-of-mass-destruction
- various-redteams
- control
- safeguards-inference-time-auxiliaries
- nick-joseph
- holden-karnofsky
- 80k-podcast-nick-joseph-anthropic-safety
- 80k-podcast-holden-karnofsky-concrete-safety
- atlas-ch3-strategies-06-socio-technical-strategies
- atlas-ch3-strategies-07-combining-strategies
- atlas-ch4-governance-04-governance-architectures
- ai-safety-atlas-textbook
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- AI Safety Atlas Ch.3 — Combining Strategies — referenced as
[[atlas-ch3-strategies-07-combining-strategies]] - AI Safety Atlas Ch.3 — Socio-Technical Strategies — referenced as
[[atlas-ch3-strategies-06-socio-technical-strategies]] - AI Safety Atlas Ch.4 — Governance Architectures — referenced as
[[atlas-ch4-governance-04-governance-architectures]] - Summary: 80,000 Hours Podcast — Holden Karnofsky on Concrete AI Safety at Frontier Companies — referenced as
[[80k-podcast-holden-karnofsky-concrete-safety]] - Summary: 80,000 Hours Podcast — Nick Joseph on Anthropic’s Safety Approach — referenced as
[[80k-podcast-nick-joseph-anthropic-safety]]