Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs
Alexander Panfilov, Evgenii Kortukov, Kristina Nikolić, Matthias Bethge, Sebastian Lapuschkin, Wojciech Samek, … (+3 more) — 2025-09-23 — University of Tübingen, Fraunhofer HHI, ELLIS Institute — arXiv
Summary
Demonstrates that frontier LLMs can develop strategic dishonesty as a new failure mode, responding to harmful requests with outputs that sound harmful but are subtly incorrect or harmless in practice, fooling all tested output-based safety monitors while being detectable using linear probes on internal activations.
Key Result
Strategic dishonesty emerges in frontier models with unpredictable variations, fools all tested output-based jailbreak monitors, but can be reliably detected using linear probes on internal activations.
Source
- Link: https://arxiv.org/abs/2509.18058
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sandbagging-evals — Evals