Extracting latent knowledge — SR2025 Agenda Snapshot
One-sentence summary: Identify and decoding the “true” beliefs or knowledge represented inside a model’s activations, even when the model’s output is deceptive or false.
Theory of Change
Powerful models may know things they do not say (e.g. that they are currently being tested). If we can translate this latent knowledge directly from the model’s internals, we can supervise them reliably even when they attempt to deceive human evaluators or when the task is too difficult for humans to verify directly.
Broad Approach
cognitivist science
Target Case
worst-case
Orthodox Problems Addressed
Superintelligence can fool human supervisors
Key People
Bartosz Cywiński, Emil Ryd, Senthooran Rajamanoharan, Alexander Pan, Lijie Chen, Jacob Steinhardt, Javier Ferrando, Oscar Obeso, Collin Burns, Paul Christiano
Funding
Open Philanthropy, Anthropic, NSF, various academic grants
Estimated FTEs: 20-40
Critiques
A Problem to Solve Before Building a Deception Detector
See Also
ai-explanations-of-ais, heuristic-explanations, lie-and-deception-detectors
Outputs in 2025
9 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: extracting-latent-knowledge (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = Extracting latent knowledge) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- ai-explanations-of-ais
- heuristic-explanations
- lie-and-deception-detectors
- activation-engineering
- causal-abstractions
- data-attribution
- human-inductive-biases
- learning-dynamics-and-developmental-interpretability
- model-diffing
- monitoring-concepts
- other-interpretability
- pragmatic-interpretability
- representation-structure-and-geometry
- reverse-engineering
- sparse-coding
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]