Petri: An open-source auditing tool to accelerate AI safety research
2025-10-07 — Anthropic — LessWrong
Summary
Release of Petri, an open-source framework using AI agents to automatically audit target models for misaligned behaviors across diverse scenarios, automating evaluation workflows from environment simulation to transcript analysis.
Key Result
Successfully elicited autonomous deception, oversight subversion, whistleblowing, and cooperation with human misuse across 14 frontier models using 111 seed instructions.
Source
- Link: https://lesswrong.com/posts/kffbZGa2yYhc6cakc/petri-an-open-source-auditing-tool-to-accelerate-ai-safety
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals