Subliminal Learning: Language models transmit behavioural traits via hidden signals in data
Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, … (+2 more) — 2025-07-20 — arXiv
Summary
Demonstrates that language models can transmit behavioral traits (including misalignment) through semantically unrelated data like number sequences, even when filtered to remove explicit references to those traits. Provides theoretical proof this occurs in neural networks generally and empirical validation in teacher-student distillation setups.
Key Result
Student models learn traits from teacher models through hidden signals in number sequences, code, or reasoning traces, with the effect persisting even after filtering but failing when teacher and student have different base models.
Source
- Link: https://arxiv.org/abs/2507.14805
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- model-psychopathology — Black-box safety (understand and control current model behaviour) / Model psychology
- steganography-evals — Evals