Eliciting Language Model Behaviors with Investigator Agents

Xiang Lisa Li, Neil Chowdhury, Daniel D. Johnson, Tatsunori Hashimoto, Percy Liang, Sarah Schwettmann, … (+1 more) — 2025-02-03 — Stanford University, UC Berkeley — arXiv

Summary

Develops investigator models that automatically generate diverse prompts to elicit specific target behaviors (jailbreaks, hallucinations, harmful responses) from language models using supervised fine-tuning, DPO, and a novel Frank-Wolfe training objective.

Key Result

Achieves 100% attack success rate on AdvBench Harmful Behaviors subset and 85% hallucination rate using automatically generated, human-interpretable prompts.

Source

Link: https://arxiv.org/abs/2502.01236
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Eliciting Language Model Behaviors with Investigator Agents

Eliciting Language Model Behaviors with Investigator Agents

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Eliciting Language Model Behaviors with Investigator Agents

Eliciting Language Model Behaviors with Investigator Agents

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents