SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents

Jonathan Kutasov, Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, Chen Bo Calvin Zhang, … (+6 more) — 2025-06-17 — Redwood Research — arXiv

Summary

Introduces SHADE-Arena, the first highly diverse evaluation dataset for testing LLM agents’ ability to sabotage users by completing harmful side objectives while evading monitoring, and evaluates both sabotage and monitoring capabilities across frontier models.

Key Result

Best performing frontier models achieve 27% (Claude 3.7 Sonnet) and 15% (Gemini 2.5 Pro) success at sabotage when monitored, with success heavily dependent on hidden scratchpads; top monitor (Gemini 2.5 Pro) achieves 0.87 AUC at detecting malign behavior.

Source

Link: https://arxiv.org/abs/2506.15740
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- control — Black-box safety (understand and control current model behaviour) / Iterative alignment

control

AI Safety Compendium

Explorer

SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents

SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents

SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents