Eliciting Language Model Behaviors with Investigator Agents
Xiang Lisa Li, Neil Chowdhury, Daniel D. Johnson, Tatsunori Hashimoto, Percy Liang, Sarah Schwettmann, … (+1 more) — 2025-02-03 — Stanford University, UC Berkeley — arXiv
Summary
Develops investigator models that automatically generate diverse prompts to elicit specific target behaviors (jailbreaks, hallucinations, harmful responses) from language models using supervised fine-tuning, DPO, and a novel Frank-Wolfe training objective.
Key Result
Achieves 100% attack success rate on AdvBench Harmful Behaviors subset and 85% hallucination rate using automatically generated, human-interpretable prompts.
Source
- Link: https://arxiv.org/abs/2502.01236
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals