Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
Lewis Smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, … (+2 more) — 2025-03-26 — Google DeepMind — AI Alignment Forum
Summary
Empirical evaluation showing that Sparse Autoencoders (SAEs) underperform linear probes on out-of-distribution detection of harmful intent, leading the Google DeepMind mechanistic interpretability team to deprioritize fundamental SAE research. Also introduces technical improvements including a quadratic-frequency penalty for JumpReLU SAEs to reduce high-frequency latents and frequency-weighted autointerp metrics.
Key Result
Dense linear probes achieved near-perfect performance (AUROC ~1.0) on out-of-distribution harmful intent detection, while sparse SAE probes showed substantially worse OOD generalization even when finetuned on chat data, suggesting SAEs fail to capture safety-relevant concepts as effectively as linear methods.
Source
- Link: https://alignmentforum.org/posts/4uXCAJNuPKtKBsi28/negative-results-for-saes-on-downstream-tasks
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- google-deepmind — Labs (giant companies)
- Editorial blurb (verbatim):
[Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update \#2)](https://alignmentforum.org/posts/4uXCAJNuPKtKBsi28/negative-results-for-saes-on-downstream-tasks)