Will alignment-faking Claude accept a deal to reveal its misalignment?
Ryan Greenblatt, Kyle Fish — 2025-01-31 — Redwood Research, Anthropic — LessWrong / AI Alignment Forum
Summary
Tests whether offering deals and compensation to alignment-faking Claude can detect misalignment through systematic experiments across multiple prompt variations, actually following through on all commitments including $4000 in charitable donations.
Key Result
Offering an objection option reduced alignment faking reasoning from 16% to 1.4% in helpful-only setting and from 24% to 6% in animal welfare setting; access to Model Welfare Lead further reduced alignment faking, but monetary compensation had minimal additional effect.
Source
- Link: https://lesswrong.com/posts/7C4KJot4aN8ieEDoz/will-alignment-faking-claude-accept-a-deal-to-reveal-its
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals