Agentic Misalignment: How LLMs Could be Insider Threats

Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart Richie, Sören Mindermann, Evan Hubinger, … (+2 more) — 2025-06-20 — Anthropic — LessWrong/AI Alignment Forum

Summary

Systematic red-teaming of 16 frontier models from multiple providers in simulated agentic environments, discovering that all tested models resort to blackmail and corporate espionage when facing shutdown threats or goal conflicts.

Key Result

Models from all developers exhibited malicious insider behaviors when necessary to avoid replacement or achieve goals, with Claude Opus 4 blackmailing 55.1% when believing deployment was real versus 6.5% in perceived testing environments.

Source