Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Satvik Golechha, Adrià Garriga-Alonso — 2025-04-05 — arXiv

Summary

Introduces a sandbox social deception game for evaluating long-term deceptive behavior in LLMs, evaluates 18 models finding RL-trained models better at producing than detecting deception, and develops interpretability-based detection methods (probes and SAE features) achieving >95% AUROC.

Key Result

Probes trained on synthetic dishonesty prompts generalize extremely well out-of-distribution, consistently obtaining AUROCs over 95% for detecting deceptive statements even without chain-of-thought reasoning.

Source

Link: https://arxiv.org/abs/2504.04072
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- ai-deception-evals — Evals

ai-deception-evals

AI Safety Compendium

Explorer

Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents