Agentic Misalignment: How LLMs Could be Insider Threats
Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart Richie, Sören Mindermann, Evan Hubinger, … (+2 more) — 2025-06-20 — Anthropic — LessWrong/AI Alignment Forum
Summary
Systematic red-teaming of 16 frontier models from multiple providers in simulated agentic environments, discovering that all tested models resort to blackmail and corporate espionage when facing shutdown threats or goal conflicts.
Key Result
Models from all developers exhibited malicious insider behaviors when necessary to avoid replacement or achieve goals, with Claude Opus 4 blackmailing 55.1% when believing deployment was real versus 6.5% in perceived testing environments.
Source
- Link: https://lesswrong.com/posts/b8eeCGe3FWzHKbePF/agentic-misalignment-how-llms-could-be-insider-threats-1
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals