Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, … (+9 more) — 2025-02-03 — arXiv (accepted to TMLR)
Summary
Proposes and tests model tampering attacks (modifications to latent activations or weights) as a complementary evaluation method for eliciting harmful LLM capabilities, comparing 5 input-space and 6 model tampering attacks against state-of-the-art unlearning methods.
Key Result
Model tampering attacks can empirically predict success of held-out input-space attacks, model resilience lies on a low-dimensional robustness subspace, and state-of-the-art unlearning methods can be undone within 16 steps of fine-tuning.
Source
- Link: https://arxiv.org/abs/2502.05209
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- capability-removal-unlearning — Black-box safety (understand and control current model behaviour) / Iterative alignment
- capability-evals — Evals