Discovering Undesired Rare Behaviors via Model Diff Amplification

Santiago Aranguri, Thomas McGrath — 2025-08-21 — Goodfire, NYU — Goodfire Research

Summary

Proposes model diff amplification, a method for detecting rare undesired behaviors by amplifying the logit differences between pre- and post-training models, making rare harmful behaviors 10-300x more common and thus practical to detect.

Key Result

In emergent misalignment experiments, the method increased detection of harmful behaviors from 1 in 10,000 rollouts to 1 in 30 rollouts with only 5% contaminated training data.

Source