Discovering Undesired Rare Behaviors via Model Diff Amplification
Santiago Aranguri, Thomas McGrath — 2025-08-21 — Goodfire, NYU — Goodfire Research
Summary
Proposes model diff amplification, a method for detecting rare undesired behaviors by amplifying the logit differences between pre- and post-training models, making rare harmful behaviors 10-300x more common and thus practical to detect.
Key Result
In emergent misalignment experiments, the method increased detection of harmful behaviors from 1 in 10,000 rollouts to 1 in 30 rollouts with only 5% contaminated training data.
Source
- Link: https://www.goodfire.ai/papers/model-diff-amplification
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals