Will alignment-faking Claude accept a deal to reveal its misalignment?

Ryan Greenblatt, Kyle Fish — 2025-01-31 — Redwood Research, Anthropic — LessWrong / AI Alignment Forum

Summary

Tests whether offering deals and compensation to alignment-faking Claude can detect misalignment through systematic experiments across multiple prompt variations, actually following through on all commitments including $4000 in charitable donations.

Key Result

Offering an objection option reduced alignment faking reasoning from 16% to 1.4% in helpful-only setting and from 24% to 6% in animal welfare setting; access to Model Welfare Lead further reduced alignment faking, but monetary compensation had minimal additional effect.

Source