It’s the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

Matthew Kowal, Jasper Timm, Jean-Francois Godbout, Thomas Costello, Antonio A. Arechar, Gordon Pennycook, … (+3 more) — 2025-06-03 — FAR AI, AlignmentResearch — arXiv

Summary

Introduces the Attempt to Persuade Eval (APE) benchmark that evaluates frontier LLMs’ willingness to attempt persuasion on harmful topics (conspiracies, controversial issues, harmful content) rather than persuasion success, using multi-turn conversational setups and automated evaluation.

Key Result

Many open and closed-weight frontier models frequently attempt persuasion on harmful topics, with jailbreaking increasing this willingness and revealing significant gaps in current safety guardrails.

Source

Link: https://arxiv.org/abs/2506.02873
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

It’s the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

It’s the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents