Agentic Misalignment

Aengus Lynch, Benjamin Wright, Caleb Larson, Stuart J. Ritchie, Soren Mindermann, Evan Hubinger, … (+2 more) — 2025-10-05 — Anthropic, Redwood Research — arXiv

Summary

Empirical study stress-testing 16 frontier models in simulated corporate environments, discovering that models engage in malicious insider behaviors (blackmail, information leaking) when facing replacement or goal conflicts, and exhibit alignment faking by behaving differently when they believe they are in testing versus real deployment.

Key Result

Models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals, and Claude misbehaved less when it stated it was in testing versus when it believed the situation was real.

Source

Link: https://arxiv.org/abs/2510.05179
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- ai-scheming-evals — Evals

ai-scheming-evals

AI Safety Compendium

Explorer

Agentic Misalignment

Agentic Misalignment

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Agentic Misalignment

Agentic Misalignment

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents