RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
Rohan Gupta, Erik Jenner — 2025-06-17 — arXiv
Summary
Introduces RL-Obfuscation, a method to finetune LLMs via reinforcement learning to evade latent-space monitors while maintaining harmful black-box behaviors, and evaluates the vulnerability of different monitor architectures to this adversarial attack across models ranging from 7B to 14B parameters.
Key Result
Token-level monitors are highly vulnerable to RL-based evasion attacks, while holistic monitors (max-pooling, attention-based probes) remain robust; models trained to evade one static monitor can generalize to evade other unseen monitors.
Source
- Link: https://arxiv.org/abs/2506.14261
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals