Eliciting Secret Knowledge from Language Models

Bartosz Cywiński, Emil Ryd, Rowan Wang, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy, … (+1 more) — 2025-10-01 — arXiv

Summary

Trains LLMs to possess specific knowledge that they apply downstream but deny knowing when asked directly, then designs and evaluates various black-box and white-box secret elicitation techniques to discover this hidden knowledge, releasing models and code as a public benchmark.

Key Result

Prefill attacks (black-box technique) are most effective at eliciting secret knowledge across all settings, while white-box techniques based on logit lens and sparse autoencoders also increase auditor success rates but are less effective.

Source

Link: https://arxiv.org/abs/2510.01070
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- extracting-latent-knowledge — White-box safety (i.e. Interpretability)
- ai-deception-evals — Evals

extracting-latent-knowledge
ai-deception-evals

AI Safety Compendium

Explorer

Eliciting Secret Knowledge from Language Models

Eliciting Secret Knowledge from Language Models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Eliciting Secret Knowledge from Language Models

Eliciting Secret Knowledge from Language Models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents