Anthropic
Anthropic is an AI safety company founded by former OpenAI researchers, built on the premise that it is possible to develop competitive frontier AI models while maintaining strong safety commitments. The company occupies a distinctive position in the AI landscape: it is simultaneously a commercial AI lab building powerful systems and a safety-focused research organization.
Structure: public-benefit corporation.
Safety Teams (per Shallow Review 2025)
Anthropic’s safety org is unusually large and named-team-driven, with each team led by a senior researcher:
- Scalable Alignment (Jan Leike)
- Alignment Evals (Sam Bowman)
- Interpretability (Chris Olah)
- Control (Ethan Perez)
- Model Psychiatry (Jack Lindsey)
- Character (Amanda Askell)
- Alignment Stress-Testing (Evan Hubinger)
- Alignment Mitigations (Sara Price)
- Frontier Red Team (Logan Graham)
- Safeguards
- Societal Impacts (Deep Ganguli)
- Trust and Safety (Sanderford)
- Model Welfare (Kyle Fish)
Risk Management Framework: Responsible Scaling Policy
Anthropic’s most significant institutional contribution to ai-safety is the responsible-scaling-policy (RSP), a framework for evaluating and mitigating risks from increasingly capable AI models. The RSP has three interlocking components:
- Safety Levels — A hierarchy of risk tiers with corresponding safeguard requirements. Safety requirements scale with capability: early models face lighter requirements, while more capable models trigger increasingly stringent precautions.
- capability-evaluations — Tests run before training or deployment to assess whether a model has reached dangerous capability thresholds. These include both “red-line” capabilities (actually dangerous) and “yellow-line” evaluations (early warnings).
- Required Precautions — Specific safeguards mandated at each safety level, creating a contractual relationship between capability and safety.
The RSP was among the first public, auditable safety frameworks from a frontier AI company and has influenced the development of similar frameworks at other labs.
Public Alignment Agenda
Anthropic publishes its safety thinking openly:
- Recommended directions for technical AI safety research (2025)
- Putting up Bumpers (the bumpers framing)
- Safety case checklist (Bowman)
- Core Views on AI Safety — older but foundational
Key People
Chris Olah, Evan Hubinger, Sam Marks, Johannes Treutlein, Sam Bowman, Euan Ong, Fabien Roger, Adam Jermyn, Holden Karnofsky, Jan Leike, Ethan Perez, Jack Lindsey, Amanda Askell, Kyle Fish, Sara Price, Jon Kutasov, Minae Kwon, Monty Evans, Richard Dargan, Roger Grosse, Ben Levinstein, Joseph Carlsmith, Joe Benton, Nick Joseph (head of training).
Funding
Amazon, Google, ICONIQ, Fidelity, Lightspeed, Altimeter, Baillie Gifford, BlackRock, Blackstone, Coatue, D1 Capital Partners, General Atlantic, General Catalyst, GIC, Goldman Sachs, Insight Partners, Jane Street, Ontario Teachers’ Pension Plan, Qatar Investment Authority, TPG, T. Rowe Price, WCM, XN.
Concrete Safety Work
Beyond the policy framework, Anthropic conducts extensive practical safety research:
- Alignment research on “model organisms” — Studying deliberately constructed examples of misalignment (model-organisms-of-misalignment)
- Constitutional Classifiers — Techniques for constraining model behavior (Constitutional Classifiers paper)
- Monitoring systems like Clio — Tools for observing model behavior in deployment
- Security against sabotage and backdoors — Including the 2024 sleeper-agents result (see deceptive-alignment)
Notable 2025 Outputs (selection from SR2025)
- Evaluating honesty and lie detection techniques on a diverse suite of dishonest models
- Agentic Misalignment: How LLMs could be insider threats
- Why Do Some Language Models Fake Alignment While Others Don’t?
- Forecasting Rare Language Model Behaviors
- Findings from a Pilot Anthropic—OpenAI Alignment Evaluation Exercise
- On the Biology of a Large Language Model
- Auditing language models for hidden objectives
- Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples
- Circuit Tracing: Revealing Computational Graphs in Language Models
- SHADE-Arena: Evaluating sabotage and monitoring in LLM agents
- Emergent Introspective Awareness in Large Language Models
- Reasoning models don’t always say what they think
- Petri: An open-source auditing tool to accelerate AI safety research
- Three Sketches of ASL-4 Safety Case Components
- Open-sourcing circuit tracing tools
- Natural emergent misalignment from reward hacking
Critiques
External critiques aggregated in SR2025: Stein-Perlman opinions and profile, Casper on SAE research, Carlsmith on labs in general, underelicitation in eval reports, Greenblatt critique, Samin: Unless its governance changes, Anthropic is untrustworthy, the Palantir/AWS defense partnership, Existing Safety Frameworks Imply Unreasonable Confidence.
Significance for This Wiki
Anthropic represents the most prominent attempt to build a frontier AI company with safety as a core institutional commitment rather than an afterthought. Its RSP framework and concrete safety techniques demonstrate what it looks like to operationalize ai-safety principles at an organization developing powerful AI systems. The company serves as a key case study in whether commercial incentives and safety commitments can coexist.
Related Pages
- ai-safety
- responsible-scaling-policy
- capability-evaluations
- holden-karnofsky
- jan-leike
- nick-joseph
- ai-control
- interpretability
- deceptive-alignment
- model-organisms-of-misalignment
- 80000-hours
- 80k-podcast-nick-joseph-anthropic-safety
- 80k-podcast-holden-karnofsky-concrete-safety
- concrete-problems-in-ai-safety
- openai
- deepmind
- ai-safety
- near-term-harms-vs-x-risk
- character-training-and-persona-steering
- future-of-humanity-institute
- givewell
- llm-introspection-training
- metr
- miri
- nate-soares
- nova-dassarma
- open-philanthropy
- paul-christiano
- redwood-research
- rob-wiblin
- 80k-podcast-nova-dassarma-infosec
- summary-bostrom-ai-policy