Evaluating honesty and lie detection techniques on a diverse suite of dishonest models
Rowan Wang, Johannes Treutlein, Fabien Roger, Evan Hubinger, Sam Marks — 2025-11-25 — Anthropic
Source
- Link: https://alignment.anthropic.com/2025/honesty-elicitation/
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- anthropic — Labs (giant companies)
- lie-and-deception-detectors — White-box safety (i.e. Interpretability)