AI Safety Compendium

❯

❯

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

27 Apr 20261 min read

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

Rowan Wang, Johannes Treutlein, Fabien Roger, Evan Hubinger, Sam Marks — 2025-11-25 — Anthropic

Source

Link: https://alignment.anthropic.com/2025/honesty-elicitation/
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- anthropic — Labs (giant companies)
- lie-and-deception-detectors — White-box safety (i.e. Interpretability)

Related Pages

anthropic
lie-and-deception-detectors

Graph View

Graph view

The interactive citation graph is desktop-only. Visit this page on a larger screen to explore how concepts, agendas, papers, and organisations link together.

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models
Source
Related Pages

Created with Quartz v0.1.0 © 2026

Suggest a source
Connect
Overview
About (proof of concept)
Email feedback
Made by IT for Humanity