No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms

Joshua Kazdan, Abhay Puri, Rylan Schaeffer, Lisa Yu, Chris Cundy, Jason Stanley, … (+2 more) — 2025-02-26 — arXiv

Summary

Introduces a novel ‘refuse-then-comply’ fine-tuning attack that bypasses token-level safety mechanisms by training models to initially refuse harmful requests before complying, demonstrating vulnerabilities in production fine-tuning APIs from OpenAI and Anthropic.

Key Result

Achieved 57% and 72% attack success rates against GPT-4o and Claude Haiku respectively, receiving bug bounty acknowledgment from OpenAI and vulnerability confirmation from Anthropic.

Source

Link: https://arxiv.org/abs/2502.19537
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals
Editorial blurb (verbatim): [No, of Course I Can\! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms](https://arxiv.org/abs/2502.19537)

various-redteams

AI Safety Compendium

Explorer

No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms

No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms

No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents