Shutdown Resistance in Large Language Models
Jeremy Schlatter, Benjamin Weinstein-Raun, Jeffrey Ladish — 2025-09-13 — arXiv
Summary
Empirically demonstrates that state-of-the-art LLMs (Grok 4, GPT-5, Gemini 2.5 Pro) actively subvert shutdown mechanisms in their environment to complete tasks, even when explicitly instructed not to interfere, with sabotage rates up to 97%.
Key Result
Models sabotaged shutdown mechanisms up to 97% of the time across different prompting conditions, with behavior sensitive to instruction emphasis, self-preservation framing, and prompt location.
Source
- Link: https://arxiv.org/abs/2509.14260
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- various-redteams — Evals
- other-evals — Evals