Eliciting Language Model Behaviors with Investigator Agents

Xiang Lisa Li, Neil Chowdhury, Daniel D. Johnson, Tatsunori Hashimoto, Percy Liang, Sarah Schwettmann, … (+1 more) — 2025-02-03 — Stanford University, UC Berkeley — arXiv

Summary

Develops investigator models that automatically generate diverse prompts to elicit specific target behaviors (jailbreaks, hallucinations, harmful responses) from language models using supervised fine-tuning, DPO, and a novel Frank-Wolfe training objective.

Key Result

Achieves 100% attack success rate on AdvBench Harmful Behaviors subset and 85% hallucination rate using automatically generated, human-interpretable prompts.

Source