Stress Testing Deliberative Alignment for Anti-Scheming Training
Bronson Schoen, Evgenia Nitishinskaya, Mikita Balesni, Axel Højmark, Felix Hofstätter, Jérémy Scheurer, … (+13 more) — 2025-09-19 — OpenAI, Anthropic
Source
- Link: https://www.arxiv.org/pdf/2509.15541
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- ai-scheming-evals — Evals