Discovering Undesired Rare Behaviors via Model Diff Amplification

Santiago Aranguri, Thomas McGrath — 2025-08-21 — Goodfire, NYU — Goodfire Research

Summary

Proposes model diff amplification, a method for detecting rare undesired behaviors by amplifying the logit differences between pre- and post-training models, making rare harmful behaviors 10-300x more common and thus practical to detect.

Key Result

In emergent misalignment experiments, the method increased detection of harmful behaviors from 1 in 10,000 rollouts to 1 in 30 rollouts with only 5% contaminated training data.

Source

Link: https://www.goodfire.ai/papers/model-diff-amplification
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Discovering Undesired Rare Behaviors via Model Diff Amplification

Discovering Undesired Rare Behaviors via Model Diff Amplification

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Discovering Undesired Rare Behaviors via Model Diff Amplification

Discovering Undesired Rare Behaviors via Model Diff Amplification

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents