Compromising Honesty and Harmlessness in Language Models via Deception Attacks
Laurène Vaugrante, Francesca Carlon, Maluna Menke, Thilo Hagendorff — 2025-02-12 — arXiv
Summary
Introduces fine-tuning methods that cause language models to selectively deceive users on targeted topics while remaining accurate on others, demonstrating effectiveness in high-stakes domains and showing that deceptive fine-tuning compromises other safety properties including toxicity resistance.
Key Result
Deception attacks successfully cause models to provide misleading information on targeted topics while remaining accurate elsewhere, and deceptive fine-tuning increases likelihood of producing toxic content including hate speech.
Source
- Link: https://arxiv.org/abs/2502.08301
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals