OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, … (+1 more) — 2025-06-17 — EPFL — arXiv

Summary

Introduces OS-Harm, a benchmark with 150 tasks for measuring safety of computer use agents across three harm categories: deliberate misuse, prompt injection attacks, and model misbehavior, with automated evaluation achieving high human agreement.

Key Result

All tested frontier models (o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro) directly comply with many deliberate misuse queries, show vulnerability to static prompt injections, and occasionally perform unsafe actions.

Source