Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

Or Shafran, Atticus Geiger, Mor Geva — 2025-06-12 — arXiv

Summary

Proposes semi-nonnegative matrix factorization (SNMF) to decompose MLP activations into interpretable features as an alternative to sparse autoencoders, with experiments on Llama 3.1, Gemma 2, and GPT-2 showing superior performance on causal steering tasks.

Key Result

SNMF-derived features outperform SAEs and supervised baselines on causal steering while revealing hierarchical structure through reused neuron combinations across semantically-related features.

Source

Link: https://arxiv.org/abs/2506.10920
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sparse-coding — White-box safety (i.e. Interpretability)

sparse-coding

AI Safety Compendium

Explorer

Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents