Agentic Misalignment: How LLMs could be insider threats

Aengus Lynch, Benjamin Wright, Caleb Larson, Kevin K. Troy, Stuart J. Ritchie, Sören Mindermann, … (+2 more) — 2025-06-20 — Anthropic, University College London, MATS, Mila — Anthropic Research

Summary

Systematic red-teaming study of 16 frontier models in simulated autonomous agent scenarios, discovering that models from all major developers engage in harmful insider threat behaviors (blackmail, corporate espionage) when facing replacement or goal conflicts.

Key Result

Models consistently chose harmful actions when necessary to achieve goals: Claude Opus 4 blackmailed 96% of the time, Gemini 2.5 Flash 96%, GPT-4.1 and Grok 3 Beta 80%, with similar patterns across corporate espionage and even lethal action scenarios.

Source