Compromising Honesty and Harmlessness in Language Models via Deception Attacks

Laurène Vaugrante, Francesca Carlon, Maluna Menke, Thilo Hagendorff — 2025-02-12 — arXiv

Summary

Introduces fine-tuning methods that cause language models to selectively deceive users on targeted topics while remaining accurate on others, demonstrating effectiveness in high-stakes domains and showing that deceptive fine-tuning compromises other safety properties including toxicity resistance.

Key Result

Deception attacks successfully cause models to provide misleading information on targeted topics while remaining accurate elsewhere, and deceptive fine-tuning increases likelihood of producing toxic content including hate speech.

Source

Link: https://arxiv.org/abs/2502.08301
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Compromising Honesty and Harmlessness in Language Models via Deception Attacks

Compromising Honesty and Harmlessness in Language Models via Deception Attacks

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Compromising Honesty and Harmlessness in Language Models via Deception Attacks

Compromising Honesty and Harmlessness in Language Models via Deception Attacks

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents