Lie and deception detectors — SR2025 Agenda Snapshot
One-sentence summary: Detect when a model is being deceptive or lying by building white- or black-box detectors. Some work below requires intent in their definition, while other work focuses only on whether the model states something it believes to be false, regardless of intent.
Theory of Change
Such detectors could flag suspicious behavior during evaluations or deployment, augment training to reduce deception, or audit models pre-deployment. Specific applications include alignment evaluations (e.g. by validating answers to introspective questions), safeguarding evaluations (catching models that “sandbag”, that is, strategically underperform to pass capability tests), and large-scale deployment monitoring. An honest version of a model could also provide oversight during training or detect cases where a model behaves in ways it understands are unsafe.
Broad Approach
cognitivist science
Target Case
pessimistic
Key People
Cadenza, Sam Marks, Rowan Wang, Kieron Kretschmar, Sharan Maiya, Walter Laurito, Chris Cundy, Adam Gleave, Aviel Parrack, Stefan Heimersheim, Carlo Attubato, Joseph Bloom, Jordan Taylor, Alex McKenzie, Urja Pawar, Lewis Smith, Bilal Chughtai, Neel Nanda
Funding
Anthropic, Deepmind, UK AISI, Coefficient Giving
Estimated FTEs: 10-50
Critiques
difficult to determine if behavior is strategic deception or only low level “reflexive” actions; Unclear if a model roleplaying a liar has deceptive intent. How are intentional descriptions (like deception) related to algorithmic ones (like understanding the mechanisms models use)?, Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity, Herrmann, Smith and Chughtai
See Also
reverse-engineering, ai-deception-evals, sandbagging-evals
Outputs in 2025
11 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: lie-and-deception-detectors (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = Lie and deception detectors) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- ai-deception-evals
- reverse-engineering
- sandbagging-evals
- activation-engineering
- causal-abstractions
- data-attribution
- extracting-latent-knowledge
- human-inductive-biases
- learning-dynamics-and-developmental-interpretability
- model-diffing
- monitoring-concepts
- other-interpretability
- pragmatic-interpretability
- representation-structure-and-geometry
- sparse-coding
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]