Tamper-Resistant Safeguards for Open-Weight LLMs
Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, … (+9 more) — 2024-08-01 — UC Berkeley, UIUC, Center for AI Safety — arXiv
Summary
Develops TAR, a method for building tamper-resistant safeguards into open-weight LLMs that prevents adversaries from removing safety mechanisms even after hundreds of fine-tuning steps, addressing the vulnerability where existing refusal and unlearning safeguards can be trivially removed.
Key Result
TAR greatly improves tamper-resistance of safeguards while preserving benign capabilities, preventing removal through extensive fine-tuning attacks.
Source
- Link: https://arxiv.org/abs/2408.00761
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- harm-reduction-for-open-weights — Black-box safety (understand and control current model behaviour) / Goal robustness