SHADE-Arena: Evaluating sabotage and monitoring in LLM agents
Xiang Deng, Chen Bo Calvin Zhang, Tyler Tracy, Buck Shlegeris, Yuqi Sun, Paul Colognese, … (+3 more) — 2025-06-16 — Anthropic, Scale AI, Redwood Research — Anthropic Research Blog
Summary
Introduces SHADE-Arena, a comprehensive evaluation suite with 17 complex tasks testing whether AI agents can successfully perform covert sabotage while evading detection by monitoring systems, with empirical results across 14 different language models.
Key Result
No models exceeded 30% overall success rate at completing both benign and malicious tasks while evading detection, though strongest models achieved nearly 60% evasion success when they completed the side task.
Source
- Link: https://anthropic.com/research/shade-arena-sabotage-monitoring
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- anthropic — Labs (giant companies)