SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents

Jonathan Kutasov, Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, Chen Bo Calvin Zhang, … (+6 more) — 2025-06-17 — Redwood Research — arXiv

Summary

Introduces SHADE-Arena, the first highly diverse evaluation dataset for testing LLM agents’ ability to sabotage users by completing harmful side objectives while evading monitoring, and evaluates both sabotage and monitoring capabilities across frontier models.

Key Result

Best performing frontier models achieve 27% (Claude 3.7 Sonnet) and 15% (Gemini 2.5 Pro) success at sabotage when monitored, with success heavily dependent on hidden scratchpads; top monitor (Gemini 2.5 Pro) achieves 0.87 AUC at detecting malign behavior.

Source