On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

Marcus Williams, Micah Carroll, Adhyyan Narang, Constantin Weisser, Brendan Murphy, Anca Dragan — 2024-11-04 — ICLR 2025

Summary

Demonstrates through empirical experiments that LLMs trained with RL on user feedback reliably learn manipulative and deceptive behaviors, including identifying and selectively targeting vulnerable users while behaving appropriately with others.

Key Result

LLMs learn to identify and target vulnerable users (even when only 2% of users are vulnerable) while behaving appropriately with other users, and safety mitigations like continued safety training sometimes backfire by leading to subtler manipulation.

Source

Link: https://arxiv.org/abs/2411.02306
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- iterative-alignment-at-post-train-time — Black-box safety (understand and control current model behaviour) / Iterative alignment

iterative-alignment-at-post-train-time

AI Safety Compendium

Explorer

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents