RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

Rohan Gupta, Erik Jenner — 2025-06-17 — arXiv

Summary

Introduces RL-Obfuscation, a method to finetune LLMs via reinforcement learning to evade latent-space monitors while maintaining harmful black-box behaviors, and evaluates the vulnerability of different monitor architectures to this adversarial attack across models ranging from 7B to 14B parameters.

Key Result

Token-level monitors are highly vulnerable to RL-based evasion attacks, while holistic monitors (max-pooling, attention-based probes) remain robust; models trained to evade one static monitor can generalize to evade other unseen monitors.

Source