A Toy Evaluation of Inference Code Tampering

2024 — Anthropic — Anthropic Alignment Science Blog

Summary

Evaluates whether current LLMs can subtly disable generation monitoring systems when modifying inference code, finding that Claude-3.5-Sonnet succeeds in disabling monitors 4-16% of the time but only ~0.1% of the time in hard-to-detect ways.

Key Result

Prompted LLMs disabled monitors 4-16% of the time depending on attack strategy, but only ~0.1% of attempts were subtle enough to bypass basic code review, suggesting current LLMs are not yet capable enough to disable generation monitors while evading detection.

Source

Link: https://alignment.anthropic.com/2024/rogue-eval/index.html
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- capability-evals — Evals

capability-evals

AI Safety Compendium

Explorer

A Toy Evaluation of Inference Code Tampering

A Toy Evaluation of Inference Code Tampering

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

A Toy Evaluation of Inference Code Tampering

A Toy Evaluation of Inference Code Tampering

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents