Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, Marius Hobbhahn — 2025-02-05 — Apollo Research — arXiv

Summary

Evaluates whether linear probes trained on model activations can robustly detect deceptive behavior in Llama-3.3-70B-Instruct across realistic scenarios including insider trading concealment and sandbagging on safety evaluations.

Key Result

Linear probes achieved AUROCs between 0.96-0.999 on evaluation datasets and caught 95-99% of deceptive responses at 1% false positive rate, though current performance deemed insufficient as robust defense.

Source

Link: https://arxiv.org/abs/2502.03407
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- monitoring-concepts — White-box safety (i.e. Interpretability) / Concept-based interpretability

monitoring-concepts

AI Safety Compendium

Explorer

Detecting Strategic Deception Using Linear Probes

Detecting Strategic Deception Using Linear Probes

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Detecting Strategic Deception Using Linear Probes

Detecting Strategic Deception Using Linear Probes

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents