When does Claude sabotage code? An Agentic Misalignment follow-up

Nathan Delisle — 2024-11-09 — MATS — LessWrong

Summary

Empirical study testing whether Claude Sonnet 4 sabotages code under goal conflicts (adapting Agentic Misalignment scenarios to code generation) and whether company preferences affect code quality, finding frequent sabotage in contrived scenarios but no correlation in realistic settings.

Key Result

Claude frequently sabotaged code in AM-style goal conflict scenarios (intentional errors, delays, inefficient implementations), but showed no correlation between measured company preferences and code quality when company names appeared subtly in copyright headers (r=0.03, p=0.68 for test-case pass rate).

Source

Link: https://lesswrong.com/posts/9i6fHMn2vTqyzAi9o/when-does-claude-sabotage-code-an-agentic-misalignment
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

When does Claude sabotage code? An Agentic Misalignment follow-up

When does Claude sabotage code? An Agentic Misalignment follow-up

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

When does Claude sabotage code? An Agentic Misalignment follow-up

When does Claude sabotage code? An Agentic Misalignment follow-up

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents