Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs

Xander Davies, Eric Winsor, Alexandra Souly, Tomek Korbak, Robert Kirk, Christian Schroeder de Witt, … (+1 more) — 2025-02-20 — Oxford University — arXiv

Summary

Demonstrates fundamental limitations in pointwise detection defenses for LLM fine-tuning APIs by constructing attacks that use steganographic techniques to covertly transmit dangerous knowledge through benign-appearing training samples.

Key Result

Attacks composed solely of individually benign, low-perplexity samples successfully elicit harmful responses from models fine-tuned via OpenAI’s API and evade an enhanced monitoring system designed to detect other fine-tuning attacks.

Source

Link: https://arxiv.org/abs/2502.14828
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs

Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs

Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents