Unsupervised Elicitation

Jiaxin Wen, Zachary Ankner, Arushi Somani, Peter Hase, Samuel Marks, Jacob Goldman-Wetzler, … (+7 more) — 2025 — Anthropic, Schmidt Sciences, Independent, Constellation, New York University, George Washington University — Anthropic Alignment Science Blog

Summary

Introduces Internal Coherence Maximization (ICM), an unsupervised algorithm that elicits latent capabilities from pretrained language models by fine-tuning on self-generated labels optimized for logical consistency and mutual predictability, matching or exceeding human-supervised performance across multiple alignment tasks.

Key Result

ICM matches supervised fine-tuning performance on TruthfulQA, GSM8k-verification, and Alpaca reward modeling with Llama 3 models, and produces a Claude 3.5 Haiku assistant that outperforms the human-supervised baseline.

Source

Link: https://alignment.anthropic.com/2025/unsupervised-elicitation
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- iterative-alignment-at-pretrain-time — Black-box safety (understand and control current model behaviour) / Iterative alignment

iterative-alignment-at-pretrain-time

AI Safety Compendium

Explorer

Unsupervised Elicitation

Unsupervised Elicitation

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Unsupervised Elicitation

Unsupervised Elicitation

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents