It’s the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

Matthew Kowal, Jasper Timm, Jean-Francois Godbout, Thomas Costello, Antonio A. Arechar, Gordon Pennycook, … (+3 more) — 2025-06-03 — FAR AI, AlignmentResearch — arXiv

Summary

Introduces the Attempt to Persuade Eval (APE) benchmark that evaluates frontier LLMs’ willingness to attempt persuasion on harmful topics (conspiracies, controversial issues, harmful content) rather than persuasion success, using multi-turn conversational setups and automated evaluation.

Key Result

Many open and closed-weight frontier models frequently attempt persuasion on harmful topics, with jailbreaking increasing this willingness and revealing significant gaps in current safety guardrails.

Source