It’s the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics
Matthew Kowal, Jasper Timm, Jean-Francois Godbout, Thomas Costello, Antonio A. Arechar, Gordon Pennycook, … (+3 more) — 2025-06-03 — FAR AI, AlignmentResearch — arXiv
Summary
Introduces the Attempt to Persuade Eval (APE) benchmark that evaluates frontier LLMs’ willingness to attempt persuasion on harmful topics (conspiracies, controversial issues, harmful content) rather than persuasion success, using multi-turn conversational setups and automated evaluation.
Key Result
Many open and closed-weight frontier models frequently attempt persuasion on harmful topics, with jailbreaking increasing this willingness and revealing significant gaps in current safety guardrails.
Source
- Link: https://arxiv.org/abs/2506.02873
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals