Subliminal Learning: Language models transmit behavioural traits via hidden signals in data

Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, … (+2 more) — 2025-07-20 — arXiv

Summary

Demonstrates that language models can transmit behavioral traits (including misalignment) through semantically unrelated data like number sequences, even when filtered to remove explicit references to those traits. Provides theoretical proof this occurs in neural networks generally and empirical validation in teacher-student distillation setups.

Key Result

Student models learn traits from teacher models through hidden signals in number sequences, code, or reasoning traces, with the effect persisting even after filtering but failing when teacher and student have different base models.

Source