Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

Alex Beutel, Kai Xiao, Johannes Heidecke, Lilian Weng — 2024-12-24 — OpenAI — arXiv

Summary

Develops automated red-teaming methods that generate diverse and effective attacks by using LLMs to create goal-specific prompts and rule-based rewards, combined with multi-step RL training that rewards diversity while maintaining attack effectiveness.

Key Result

The approach generates highly-effective and considerably more diverse attacks than prior general red-teaming approaches for both prompt injection and unsafe response elicitation.

Source

Link: https://arxiv.org/abs/2412.18693
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents