Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs

Xander Davies, Eric Winsor, Alexandra Souly, Tomek Korbak, Robert Kirk, Christian Schroeder de Witt, … (+1 more) — 2025-02-20 — Oxford University — arXiv

Summary

Demonstrates fundamental limitations in pointwise detection defenses for LLM fine-tuning APIs by constructing attacks that use steganographic techniques to covertly transmit dangerous knowledge through benign-appearing training samples.

Key Result

Attacks composed solely of individually benign, low-perplexity samples successfully elicit harmful responses from models fine-tuned via OpenAI’s API and evade an enhanced monitoring system designed to detect other fine-tuning attacks.

Source