Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

Chen Yueh-Han, Nitish Joshi, Yulin Chen, Maksym Andriushchenko, Rico Angell, He He — 2025-06-14 — arXiv

Summary

Demonstrates that decomposition attacks (breaking malicious goals into benign subtasks) achieve 87% success rate on GPT-4o, and proposes a lightweight sequential monitoring framework that achieves 93% defense success rate by cumulatively evaluating each subtask in a conversation.

Key Result

Decomposition attacks succeed 87% of the time on GPT-4o, while a carefully prompt-engineered lightweight sequential monitor achieves 93% defense success rate, beating reasoning models like o3 mini.

Source

Link: https://arxiv.org/abs/2506.10949
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents