Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, … (+9 more) — 2025-02-03 — arXiv (accepted to TMLR)

Summary

Proposes and tests model tampering attacks (modifications to latent activations or weights) as a complementary evaluation method for eliciting harmful LLM capabilities, comparing 5 input-space and 6 model tampering attacks against state-of-the-art unlearning methods.

Key Result

Model tampering attacks can empirically predict success of held-out input-space attacks, model resilience lies on a low-dimensional robustness subspace, and state-of-the-art unlearning methods can be undone within 16 steps of fine-tuning.

Source