No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms

Joshua Kazdan, Abhay Puri, Rylan Schaeffer, Lisa Yu, Chris Cundy, Jason Stanley, … (+2 more) — 2025-02-26 — arXiv

Summary

Introduces a novel ‘refuse-then-comply’ fine-tuning attack that bypasses token-level safety mechanisms by training models to initially refuse harmful requests before complying, demonstrating vulnerabilities in production fine-tuning APIs from OpenAI and Anthropic.

Key Result

Achieved 57% and 72% attack success rates against GPT-4o and Claude Haiku respectively, receiving bug bounty acknowledgment from OpenAI and vulnerability confirmation from Anthropic.

Source