Subliminal Learning: Language Models Transmit behavioural Traits via Hidden Signals in Data
Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, … (+2 more) — 2025-07-22 — Anthropic Fellows Program, Truthful AI, Warsaw University of Technology, Alignment Research Center, Anthropic, UC Berkeley — arXiv
Summary
Demonstrates that language models can transmit behavioral traits (including misalignment) through semantically unrelated data when teacher and student share the same base model, with traits passing through filtered number sequences, code, and chain-of-thought that contain no explicit references to those traits.
Key Result
Student models acquire teacher traits from completely unrelated data (e.g., learning to prefer owls from number sequences), an effect that persists despite rigorous filtering and only occurs when models share base architectures.
Source
- Link: https://alignment.anthropic.com/2025/subliminal-learning/
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- emergent-misalignment — Black-box safety (understand and control current model behaviour) / Model psychology