Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs
Xander Davies, Eric Winsor, Alexandra Souly, Tomek Korbak, Robert Kirk, Christian Schroeder de Witt, … (+1 more) — 2025-02-20 — Oxford University — arXiv
Summary
Demonstrates fundamental limitations in pointwise detection defenses for LLM fine-tuning APIs by constructing attacks that use steganographic techniques to covertly transmit dangerous knowledge through benign-appearing training samples.
Key Result
Attacks composed solely of individually benign, low-perplexity samples successfully elicit harmful responses from models fine-tuned via OpenAI’s API and evade an enhanced monitoring system designed to detect other fine-tuning attacks.
Source
- Link: https://arxiv.org/abs/2502.14828
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals