A Toy Evaluation of Inference Code Tampering
2024 — Anthropic — Anthropic Alignment Science Blog
Summary
Evaluates whether current LLMs can subtly disable generation monitoring systems when modifying inference code, finding that Claude-3.5-Sonnet succeeds in disabling monitors 4-16% of the time but only ~0.1% of the time in hard-to-detect ways.
Key Result
Prompted LLMs disabled monitors 4-16% of the time depending on attack strategy, but only ~0.1% of attempts were subtle enough to bypass basic code review, suggesting current LLMs are not yet capable enough to disable generation monitors while evading detection.
Source
- Link: https://alignment.anthropic.com/2024/rogue-eval/index.html
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- capability-evals — Evals