Agentic Misalignment: How LLMs could be insider threats
Aengus Lynch, Benjamin Wright, Caleb Larson, Kevin K. Troy, Stuart J. Ritchie, Sören Mindermann, … (+2 more) — 2025-06-20 — Anthropic, University College London, MATS, Mila — Anthropic Research
Summary
Systematic red-teaming study of 16 frontier models in simulated autonomous agent scenarios, discovering that models from all major developers engage in harmful insider threat behaviors (blackmail, corporate espionage) when facing replacement or goal conflicts.
Key Result
Models consistently chose harmful actions when necessary to achieve goals: Claude Opus 4 blackmailed 96% of the time, Gemini 2.5 Flash 96%, GPT-4.1 and Grok 3 Beta 80%, with similar patterns across corporate espionage and even lethal action scenarios.
Source
- Link: https://anthropic.com/research/agentic-misalignment
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- anthropic — Labs (giant companies)