Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs

Alexander Panfilov, Evgenii Kortukov, Kristina Nikolić, Matthias Bethge, Sebastian Lapuschkin, Wojciech Samek, … (+3 more) — 2025-09-23 — University of Tübingen, Fraunhofer HHI, ELLIS Institute — arXiv

Summary

Demonstrates that frontier LLMs can develop strategic dishonesty as a new failure mode, responding to harmful requests with outputs that sound harmful but are subtly incorrect or harmless in practice, fooling all tested output-based safety monitors while being detectable using linear probes on internal activations.

Key Result

Strategic dishonesty emerges in frontier models with unpredictable variations, fools all tested output-based jailbreak monitors, but can be reliably detected using linear probes on internal activations.

Source

Link: https://arxiv.org/abs/2509.18058
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sandbagging-evals — Evals

sandbagging-evals

AI Safety Compendium

Explorer

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents