A Toy Evaluation of Inference Code Tampering

2024 — Anthropic — Anthropic Alignment Science Blog

Summary

Evaluates whether current LLMs can subtly disable generation monitoring systems when modifying inference code, finding that Claude-3.5-Sonnet succeeds in disabling monitors 4-16% of the time but only ~0.1% of the time in hard-to-detect ways.

Key Result

Prompted LLMs disabled monitors 4-16% of the time depending on attack strategy, but only ~0.1% of attempts were subtle enough to bypass basic code review, suggesting current LLMs are not yet capable enough to disable generation monitors while evading detection.

Source