AgentBreeder: Mitigating the AI Safety Risks of Multi-Agent Scaffolds via Self-Improvement

J Rosser, Jakob Foerster — 2025-02-02 — arXiv

Summary

Introduces AgentBreeder, a framework for multi-objective evolutionary search over multi-agent LLM scaffolds, demonstrating that scaffolds can be optimized for safety (79.4% uplift in blue mode) while also revealing adversarially weak scaffolds that emerge during capability optimization (red mode).

Key Result

Blue mode optimization achieved 79.4% average uplift in safety benchmark performance while maintaining capabilities; red mode discovered adversarially weak scaffolds emerging alongside capability improvements.

Source

Link: https://arxiv.org/abs/2502.00757
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- tools-for-aligning-multiple-ais — Multi-agent first

tools-for-aligning-multiple-ais

AI Safety Compendium

Explorer

AgentBreeder: Mitigating the AI Safety Risks of Multi-Agent Scaffolds via Self-Improvement

AgentBreeder: Mitigating the AI Safety Risks of Multi-Agent Scaffolds via Self-Improvement

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

AgentBreeder: Mitigating the AI Safety Risks of Multi-Agent Scaffolds via Self-Improvement

AgentBreeder: Mitigating the AI Safety Risks of Multi-Agent Scaffolds via Self-Improvement

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents