No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms
Joshua Kazdan, Abhay Puri, Rylan Schaeffer, Lisa Yu, Chris Cundy, Jason Stanley, … (+2 more) — 2025-02-26 — arXiv
Summary
Introduces a novel ‘refuse-then-comply’ fine-tuning attack that bypasses token-level safety mechanisms by training models to initially refuse harmful requests before complying, demonstrating vulnerabilities in production fine-tuning APIs from OpenAI and Anthropic.
Key Result
Achieved 57% and 72% attack success rates against GPT-4o and Claude Haiku respectively, receiving bug bounty acknowledgment from OpenAI and vulnerability confirmation from Anthropic.
Source
- Link: https://arxiv.org/abs/2502.19537
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals
- Editorial blurb (verbatim):
[No, of Course I Can\! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms](https://arxiv.org/abs/2502.19537)