Unsupervised Elicitation
Jiaxin Wen, Zachary Ankner, Arushi Somani, Peter Hase, Samuel Marks, Jacob Goldman-Wetzler, … (+7 more) — 2025 — Anthropic, Schmidt Sciences, Independent, Constellation, New York University, George Washington University — Anthropic Alignment Science Blog
Summary
Introduces Internal Coherence Maximization (ICM), an unsupervised algorithm that elicits latent capabilities from pretrained language models by fine-tuning on self-generated labels optimized for logical consistency and mutual predictability, matching or exceeding human-supervised performance across multiple alignment tasks.
Key Result
ICM matches supervised fine-tuning performance on TruthfulQA, GSM8k-verification, and Alpaca reward modeling with Llama 3 models, and produces a Claude 3.5 Haiku assistant that outperforms the human-supervised baseline.
Source
- Link: https://alignment.anthropic.com/2025/unsupervised-elicitation
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- iterative-alignment-at-pretrain-time — Black-box safety (understand and control current model behaviour) / Iterative alignment