Discovering Undesired Rare Behaviors via Model Diff Amplification
Santiago Aranguri, Thomas McGrath — 2025-08-21 — Goodfire, NYU
Source
- Link: https://www.goodfire.ai/research/model-diff-amplification#
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- model-diffing — White-box safety (i.e. Interpretability)