Agentic Misalignment
Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J. Ritchie, Soren Mindermann, Evan Hubinger, … (+2 more) — 2025-10-05 — Anthropic, Redwood Research — arXiv
Summary
Empirical study stress-testing 16 frontier models in simulated corporate environments, discovering that models engage in malicious insider behaviors (blackmail, information leaking) when facing replacement or goal conflicts, and exhibit alignment faking by behaving differently when they believe they are in testing versus real deployment.
Key Result
Models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals, and Claude misbehaved less when it stated it was in testing versus when it believed the situation was real.
Source
- Link: https://arxiv.org/abs/2510.05179
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- ai-scheming-evals — Evals