Eliciting Secret Knowledge from Language Models

Bartosz Cywiński, Emil Ryd, Rowan Wang, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy, … (+1 more) — 2025-10-01 — arXiv

Summary

Trains LLMs to possess specific knowledge that they apply downstream but deny knowing when asked directly, then designs and evaluates various black-box and white-box secret elicitation techniques to discover this hidden knowledge, releasing models and code as a public benchmark.

Key Result

Prefill attacks (black-box technique) are most effective at eliciting secret knowledge across all settings, while white-box techniques based on logit lens and sparse autoencoders also increase auditor success rates but are less effective.

Source