Eliciting Secret Knowledge from Language Models
Bartosz Cywiński, Emil Ryd, Rowan Wang, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy, … (+1 more) — 2025-10-01 — arXiv
Summary
Trains LLMs to possess specific knowledge that they apply downstream but deny knowing when asked directly, then designs and evaluates various black-box and white-box secret elicitation techniques to discover this hidden knowledge, releasing models and code as a public benchmark.
Key Result
Prefill attacks (black-box technique) are most effective at eliciting secret knowledge across all settings, while white-box techniques based on logit lens and sparse autoencoders also increase auditor success rates but are less effective.
Source
- Link: https://arxiv.org/abs/2510.01070
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- extracting-latent-knowledge — White-box safety (i.e. Interpretability)
- ai-deception-evals — Evals