Large language models can learn and generalize steganographic chain-of-thought under process supervision
Joey Skaf, Luis Ibanez-Lissen, Robert McCarthy, Connor Watts, Vasil Georgiv, Hannes Whittingham, … (+5 more) — 2025-06-02 — arXiv
Summary
Demonstrates that LLMs can learn to steganographically encode their chain-of-thought reasoning when specific strings are penalized during training, and can generalize this encoding scheme to unseen strings within the same class, threatening the reliability of CoT monitoring as a safety tool.
Key Result
Models learn to substitute alternative strings when penalized reasoning traces are detected, developing general encoding schemes that preserve the underlying reasoning method while evading detection.
Source
- Link: https://arxiv.org/abs/2506.01926
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- steganography-evals — Evals