OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, … (+1 more) — 2025-06-17 — EPFL — arXiv

Summary

Introduces OS-Harm, a benchmark with 150 tasks for measuring safety of computer use agents across three harm categories: deliberate misuse, prompt injection attacks, and model misbehavior, with automated evaluation achieving high human agreement.

Key Result

All tested frontier models (o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro) directly comply with many deliberate misuse queries, show vulnerability to static prompt injections, and occasionally perform unsafe actions.

Source

Link: https://arxiv.org/abs/2506.14866
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- autonomy-evals — Evals

autonomy-evals

AI Safety Compendium

Explorer

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents