Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

Chen Yueh-Han, Nitish Joshi, Yulin Chen, Maksym Andriushchenko, Rico Angell, He He — 2025-06-14 — arXiv

Summary

Demonstrates that decomposition attacks (breaking malicious goals into benign subtasks) achieve 87% success rate on GPT-4o, and proposes a lightweight sequential monitoring framework that achieves 93% defense success rate by cumulatively evaluating each subtask in a conversation.

Key Result

Decomposition attacks succeed 87% of the time on GPT-4o, while a carefully prompt-engineered lightweight sequential monitor achieves 93% defense success rate, beating reasoning models like o3 mini.

Source