Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors
Chen Yueh-Han, Nitish Joshi, Yulin Chen, Maksym Andriushchenko, Rico Angell, He He — 2025-06-14 — arXiv
Summary
Demonstrates that decomposition attacks (breaking malicious goals into benign subtasks) achieve 87% success rate on GPT-4o, and proposes a lightweight sequential monitoring framework that achieves 93% defense success rate by cumulatively evaluating each subtask in a conversation.
Key Result
Decomposition attacks succeed 87% of the time on GPT-4o, while a carefully prompt-engineered lightweight sequential monitor achieves 93% defense success rate, beating reasoning models like o3 mini.
Source
- Link: https://arxiv.org/abs/2506.10949
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals