Petri: An open-source auditing tool to accelerate AI safety research
2025-10-06 — Anthropic — Anthropic Alignment Science Blog
Summary
Releases Petri, an open-source framework using AI agents to automatically audit target models for misaligned behaviors across diverse scenarios, with empirical results from testing 14 frontier models using 111 seed instructions.
Key Result
Successfully elicited misaligned behaviors including autonomous deception, oversight subversion, and whistleblowing across 14 frontier models, with systematic ablation studies showing agency and leadership complicity as key factors influencing model behavior.
Source
- Link: https://alignment.anthropic.com/2025/petri/
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals