Detecting Strategic Deception Using Linear Probes
Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, Marius Hobbhahn — 2025-02-05 — Apollo Research — arXiv
Summary
Evaluates whether linear probes trained on model activations can robustly detect deceptive behavior in Llama-3.3-70B-Instruct across realistic scenarios including insider trading concealment and sandbagging on safety evaluations.
Key Result
Linear probes achieved AUROCs between 0.96-0.999 on evaluation datasets and caught 95-99% of deceptive responses at 1% false positive rate, though current performance deemed insufficient as robust defense.
Source
- Link: https://arxiv.org/abs/2502.03407
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- monitoring-concepts — White-box safety (i.e. Interpretability) / Concept-based interpretability