SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
Jonathan Kutasov, Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, Chen Bo Calvin Zhang, … (+6 more) — 2025-06-17 — Redwood Research — arXiv
Summary
Introduces SHADE-Arena, the first highly diverse evaluation dataset for testing LLM agents’ ability to sabotage users by completing harmful side objectives while evading monitoring, and evaluates both sabotage and monitoring capabilities across frontier models.
Key Result
Best performing frontier models achieve 27% (Claude 3.7 Sonnet) and 15% (Gemini 2.5 Pro) success at sabotage when monitored, with success heavily dependent on hidden scratchpads; top monitor (Gemini 2.5 Pro) achieves 0.87 AUC at detecting malign behavior.
Source
- Link: https://arxiv.org/abs/2506.15740
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- control — Black-box safety (understand and control current model behaviour) / Iterative alignment