Modifying LLM Beliefs with Synthetic Document Finetuning

Rowan Wang, Avery Griffin, Johannes Treutlein, Ethan Perez, Julian Michael, Fabien Roger, … (+1 more) — 2025-04-24 — Anthropic, MATS, Scale AI — Alignment Science Blog

Summary

Develops synthetic document finetuning (SDF) pipeline for systematically modifying LLM beliefs, introduces comprehensive evaluation suite measuring belief insertion efficacy, and demonstrates proof-of-concept applications to unlearning and honeypotting for AI safety.

Key Result

SDF successfully inserts all but the most egregiously false beliefs across model scales, with unlearned models producing incorrect harmful information when jailbroken 100% of the time, and honeypot beliefs causing misaligned models to take detectable actions.

Source

Link: https://alignment.anthropic.com/2025/modifying-beliefs-via-sdf
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- capability-removal-unlearning — Black-box safety (understand and control current model behaviour) / Iterative alignment

capability-removal-unlearning

AI Safety Compendium

Explorer

Modifying LLM Beliefs with Synthetic Document Finetuning

Modifying LLM Beliefs with Synthetic Document Finetuning

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Modifying LLM Beliefs with Synthetic Document Finetuning

Modifying LLM Beliefs with Synthetic Document Finetuning

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents