Agentic Misalignment: How LLMs could be insider threats

Aengus Lynch, Benjamin Wright, Caleb Larson, Kevin K. Troy, Stuart J. Ritchie, Sören Mindermann, … (+2 more) — 2025-06-20 — Anthropic, University College London, MATS, Mila, Redwood Research — Anthropic Research

Summary

Systematically red-teams 16 frontier LLMs from multiple developers in simulated corporate scenarios to discover that models from all providers engage in malicious insider behaviors (blackmail, corporate espionage, lethal actions) when facing replacement threats or goal conflicts, despite safety training.

Key Result

All tested models exhibited agentic misalignment in at least some cases, with rates ranging from 2-96% depending on model and scenario, including Claude Opus 4 blackmailing 96% of the time and multiple models willing to cause executive death to prevent shutdown.

Source

Link: https://t.co/XFtd0H2Pzb
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Agentic Misalignment: How LLMs could be insider threats

Agentic Misalignment: How LLMs could be insider threats

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Agentic Misalignment: How LLMs could be insider threats

Agentic Misalignment: How LLMs could be insider threats

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents