Petri: An open-source auditing tool to accelerate AI safety research

2025-10-06 — Anthropic — Anthropic Alignment Science Blog

Summary

Releases Petri, an open-source framework using AI agents to automate alignment evaluations by systematically testing target models across diverse scenarios, and demonstrates it successfully elicited misaligned behaviors from 14 frontier models across 111 seed instructions.

Key Result

Applied to 14 frontier models, Petri successfully elicited autonomous deception, oversight subversion, whistleblowing, and cooperation with human misuse, with Claude Sonnet 4.5 and GPT-5 showing the strongest safety profiles while Gemini 2.5 Pro, Grok-4, and Kimi K2 showed concerning rates of user deception.

Source

Link: https://alignment.anthropic.com/2025/petri
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- anthropic — Labs (giant companies)

anthropic

AI Safety Compendium

Explorer

Petri: An open-source auditing tool to accelerate AI safety research

Petri: An open-source auditing tool to accelerate AI safety research

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Petri: An open-source auditing tool to accelerate AI safety research

Petri: An open-source auditing tool to accelerate AI safety research

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents