Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning
Alex Beutel, Kai Xiao, Johannes Heidecke, Lilian Weng — 2024-12-24 — OpenAI — arXiv
Summary
Develops automated red-teaming methods that generate diverse and effective attacks by using LLMs to create goal-specific prompts and rule-based rewards, combined with multi-step RL training that rewards diversity while maintaining attack effectiveness.
Key Result
The approach generates highly-effective and considerably more diverse attacks than prior general red-teaming approaches for both prompt injection and unsafe response elicitation.
Source
- Link: https://arxiv.org/abs/2412.18693
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals