Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, Marius Hobbhahn — 2025-02-05 — Apollo Research — arXiv

Summary

Evaluates whether linear probes trained on model activations can robustly detect deceptive behavior in Llama-3.3-70B-Instruct across realistic scenarios including insider trading concealment and sandbagging on safety evaluations.

Key Result

Linear probes achieved AUROCs between 0.96-0.999 on evaluation datasets and caught 95-99% of deceptive responses at 1% false positive rate, though current performance deemed insufficient as robust defense.

Source