Unsupervised Elicitation

Jiaxin Wen, Zachary Ankner, Arushi Somani, Peter Hase, Samuel Marks, Jacob Goldman-Wetzler, … (+7 more) — 2025 — Anthropic, Schmidt Sciences, Independent, Constellation, New York University, George Washington University — Anthropic Alignment Science Blog

Summary

Introduces Internal Coherence Maximization (ICM), an unsupervised algorithm that elicits latent capabilities from pretrained language models by fine-tuning on self-generated labels optimized for logical consistency and mutual predictability, matching or exceeding human-supervised performance across multiple alignment tasks.

Key Result

ICM matches supervised fine-tuning performance on TruthfulQA, GSM8k-verification, and Alpaca reward modeling with Llama 3 models, and produces a Claude 3.5 Haiku assistant that outperforms the human-supervised baseline.

Source