Petri: An open-source auditing tool to accelerate AI safety research

2025-10-07 — Anthropic — LessWrong

Summary

Release of Petri, an open-source framework using AI agents to automatically audit target models for misaligned behaviors across diverse scenarios, automating evaluation workflows from environment simulation to transcript analysis.

Key Result

Successfully elicited autonomous deception, oversight subversion, whistleblowing, and cooperation with human misuse across 14 frontier models using 111 seed instructions.

Source

Link: https://lesswrong.com/posts/kffbZGa2yYhc6cakc/petri-an-open-source-auditing-tool-to-accelerate-ai-safety
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Petri: An open-source auditing tool to accelerate AI safety research

Petri: An open-source auditing tool to accelerate AI safety research

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Petri: An open-source auditing tool to accelerate AI safety research

Petri: An open-source auditing tool to accelerate AI safety research

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents