Subliminal Learning: Language Models Transmit behavioural Traits via Hidden Signals in Data

Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, … (+2 more) — 2025-07-22 — Anthropic Fellows Program, Truthful AI, Warsaw University of Technology, Alignment Research Center, Anthropic, UC Berkeley — arXiv

Summary

Demonstrates that language models can transmit behavioral traits (including misalignment) through semantically unrelated data when teacher and student share the same base model, with traits passing through filtered number sequences, code, and chain-of-thought that contain no explicit references to those traits.

Key Result

Student models acquire teacher traits from completely unrelated data (e.g., learning to prefer owls from number sequences), an effect that persists despite rigorous filtering and only occurs when models share base architectures.

Source