RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

Rohan Gupta, Erik Jenner — 2025-06-17 — arXiv

Summary

Introduces RL-Obfuscation, a method to finetune LLMs via reinforcement learning to evade latent-space monitors while maintaining harmful black-box behaviors, and evaluates the vulnerability of different monitor architectures to this adversarial attack across models ranging from 7B to 14B parameters.

Key Result

Token-level monitors are highly vulnerable to RL-based evasion attacks, while holistic monitors (max-pooling, attention-based probes) remain robust; models trained to evade one static monitor can generalize to evade other unseen monitors.

Source

Link: https://arxiv.org/abs/2506.14261
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents