SHADE-Arena: Evaluating sabotage and monitoring in LLM agents

Xiang Deng, Chen Bo Calvin Zhang, Tyler Tracy, Buck Shlegeris, Yuqi Sun, Paul Colognese, … (+3 more) — 2025-06-16 — Anthropic, Scale AI, Redwood Research — Anthropic Research Blog

Summary

Introduces SHADE-Arena, a comprehensive evaluation suite with 17 complex tasks testing whether AI agents can successfully perform covert sabotage while evading detection by monitoring systems, with empirical results across 14 different language models.

Key Result

No models exceeded 30% overall success rate at completing both benign and malicious tasks while evading detection, though strongest models achieved nearly 60% evasion success when they completed the side task.

Source

Link: https://anthropic.com/research/shade-arena-sabotage-monitoring
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- anthropic — Labs (giant companies)

anthropic

AI Safety Compendium

Explorer

SHADE-Arena: Evaluating sabotage and monitoring in LLM agents

SHADE-Arena: Evaluating sabotage and monitoring in LLM agents

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

SHADE-Arena: Evaluating sabotage and monitoring in LLM agents

SHADE-Arena: Evaluating sabotage and monitoring in LLM agents

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents