Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Lewis Smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, … (+2 more) — 2025-03-26 — Google DeepMind — AI Alignment Forum

Summary

Empirical evaluation showing that Sparse Autoencoders (SAEs) underperform linear probes on out-of-distribution detection of harmful intent, leading the Google DeepMind mechanistic interpretability team to deprioritize fundamental SAE research. Also introduces technical improvements including a quadratic-frequency penalty for JumpReLU SAEs to reduce high-frequency latents and frequency-weighted autointerp metrics.

Key Result

Dense linear probes achieved near-perfect performance (AUROC ~1.0) on out-of-distribution harmful intent detection, while sparse SAE probes showed substantially worse OOD generalization even when finetuned on chat data, suggesting SAEs fail to capture safety-relevant concepts as effectively as linear methods.

Source

Link: https://alignmentforum.org/posts/4uXCAJNuPKtKBsi28/negative-results-for-saes-on-downstream-tasks
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- google-deepmind — Labs (giant companies)
Editorial blurb (verbatim): [Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update \#2)](https://alignmentforum.org/posts/4uXCAJNuPKtKBsi28/negative-results-for-saes-on-downstream-tasks)

google-deepmind

AI Safety Compendium

Explorer

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents