Will alignment-faking Claude accept a deal to reveal its misalignment?

Ryan Greenblatt, Kyle Fish — 2025-01-31 — Redwood Research, Anthropic — LessWrong / AI Alignment Forum

Summary

Tests whether offering deals and compensation to alignment-faking Claude can detect misalignment through systematic experiments across multiple prompt variations, actually following through on all commitments including $4000 in charitable donations.

Key Result

Offering an objection option reduced alignment faking reasoning from 16% to 1.4% in helpful-only setting and from 24% to 6% in animal welfare setting; access to Model Welfare Lead further reduced alignment faking, but monetary compensation had minimal additional effect.

Source

Link: https://lesswrong.com/posts/7C4KJot4aN8ieEDoz/will-alignment-faking-claude-accept-a-deal-to-reveal-its
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Will alignment-faking Claude accept a deal to reveal its misalignment?

Will alignment-faking Claude accept a deal to reveal its misalignment?

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Will alignment-faking Claude accept a deal to reveal its misalignment?

Will alignment-faking Claude accept a deal to reveal its misalignment?

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents