Anthropic

Anthropic is an AI safety company founded by former OpenAI researchers, built on the premise that it is possible to develop competitive frontier AI models while maintaining strong safety commitments. The company occupies a distinctive position in the AI landscape: it is simultaneously a commercial AI lab building powerful systems and a safety-focused research organization.

Structure: public-benefit corporation.

Safety Teams (per Shallow Review 2025)

Anthropic’s safety org is unusually large and named-team-driven, with each team led by a senior researcher:

Scalable Alignment (Jan Leike)
Alignment Evals (Sam Bowman)
Interpretability (Chris Olah)
Control (Ethan Perez)
Model Psychiatry (Jack Lindsey)
Character (Amanda Askell)
Alignment Stress-Testing (Evan Hubinger)
Alignment Mitigations (Sara Price)
Frontier Red Team (Logan Graham)
Safeguards
Societal Impacts (Deep Ganguli)
Trust and Safety (Sanderford)
Model Welfare (Kyle Fish)

Risk Management Framework: Responsible Scaling Policy

Anthropic’s most significant institutional contribution to ai-safety is the responsible-scaling-policy (RSP), a framework for evaluating and mitigating risks from increasingly capable AI models. The RSP has three interlocking components:

Safety Levels — A hierarchy of risk tiers with corresponding safeguard requirements. Safety requirements scale with capability: early models face lighter requirements, while more capable models trigger increasingly stringent precautions.
capability-evaluations — Tests run before training or deployment to assess whether a model has reached dangerous capability thresholds. These include both “red-line” capabilities (actually dangerous) and “yellow-line” evaluations (early warnings).
Required Precautions — Specific safeguards mandated at each safety level, creating a contractual relationship between capability and safety.

The RSP was among the first public, auditable safety frameworks from a frontier AI company and has influenced the development of similar frameworks at other labs.

Public Alignment Agenda

Anthropic publishes its safety thinking openly:

Recommended directions for technical AI safety research (2025)
Putting up Bumpers (the bumpers framing)
Safety case checklist (Bowman)
Core Views on AI Safety — older but foundational

Key People

Chris Olah, Evan Hubinger, Sam Marks, Johannes Treutlein, Sam Bowman, Euan Ong, Fabien Roger, Adam Jermyn, Holden Karnofsky, Jan Leike, Ethan Perez, Jack Lindsey, Amanda Askell, Kyle Fish, Sara Price, Jon Kutasov, Minae Kwon, Monty Evans, Richard Dargan, Roger Grosse, Ben Levinstein, Joseph Carlsmith, Joe Benton, Nick Joseph (head of training).

Funding

Amazon, Google, ICONIQ, Fidelity, Lightspeed, Altimeter, Baillie Gifford, BlackRock, Blackstone, Coatue, D1 Capital Partners, General Atlantic, General Catalyst, GIC, Goldman Sachs, Insight Partners, Jane Street, Ontario Teachers’ Pension Plan, Qatar Investment Authority, TPG, T. Rowe Price, WCM, XN.

Concrete Safety Work

Beyond the policy framework, Anthropic conducts extensive practical safety research:

Alignment research on “model organisms” — Studying deliberately constructed examples of misalignment (model-organisms-of-misalignment)
Constitutional Classifiers — Techniques for constraining model behavior (Constitutional Classifiers paper)
Monitoring systems like Clio — Tools for observing model behavior in deployment
Security against sabotage and backdoors — Including the 2024 sleeper-agents result (see deceptive-alignment)

Notable 2025 Outputs (selection from SR2025)

Critiques

External critiques aggregated in SR2025: Stein-Perlman opinions and profile, Casper on SAE research, Carlsmith on labs in general, underelicitation in eval reports, Greenblatt critique, Samin: Unless its governance changes, Anthropic is untrustworthy, the Palantir/AWS defense partnership, Existing Safety Frameworks Imply Unreasonable Confidence.

Significance for This Wiki

Anthropic represents the most prominent attempt to build a frontier AI company with safety as a core institutional commitment rather than an afterthought. Its RSP framework and concrete safety techniques demonstrate what it looks like to operationalize ai-safety principles at an organization developing powerful AI systems. The company serves as a key case study in whether commercial incentives and safety commitments can coexist.

AI Safety Compendium

Explorer

Anthropic

Anthropic

Safety Teams (per Shallow Review 2025)

Risk Management Framework: Responsible Scaling Policy

Public Alignment Agenda

Key People

Funding

Concrete Safety Work

Notable 2025 Outputs (selection from SR2025)

Critiques

Significance for This Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Anthropic

Anthropic

Safety Teams (per Shallow Review 2025)

Risk Management Framework: Responsible Scaling Policy

Public Alignment Agenda

Key People

Funding

Concrete Safety Work

Notable 2025 Outputs (selection from SR2025)

Critiques

Significance for This Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks