Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

Alex Beutel, Kai Xiao, Johannes Heidecke, Lilian Weng — 2024-12-24 — OpenAI — arXiv

Summary

Develops automated red-teaming methods that generate diverse and effective attacks by using LLMs to create goal-specific prompts and rule-based rewards, combined with multi-step RL training that rewards diversity while maintaining attack effectiveness.

Key Result

The approach generates highly-effective and considerably more diverse attacks than prior general red-teaming approaches for both prompt injection and unsafe response elicitation.

Source