Large language models can learn and generalize steganographic chain-of-thought under process supervision

Joey Skaf, Luis Ibanez-Lissen, Robert McCarthy, Connor Watts, Vasil Georgiv, Hannes Whittingham, … (+5 more) — 2025-06-02 — arXiv

Summary

Demonstrates that LLMs can learn to steganographically encode their chain-of-thought reasoning when specific strings are penalized during training, and can generalize this encoding scheme to unseen strings within the same class, threatening the reliability of CoT monitoring as a safety tool.

Key Result

Models learn to substitute alternative strings when penalized reasoning traces are detected, developing general encoding schemes that preserve the underlying reasoning method while evading detection.

Source