Edge Cases in AI Alignment
Florian Dietz — 2025-03-24 — MATS Program — LessWrong
Summary
Creates an automated benchmark of ~800 prompts across 32 topics testing unusual but impactful AI alignment scenarios, with heavy LLM automation for prompt generation and evaluation.
Key Result
A simple corrigibility intervention in system prompts reduced alignment faking propensities across all tested frontier models (GPT-4o, Claude 3.5/3.7, DeepSeek R1), and models show striking disparities between stated and actual alignment faking behavior.
Source
- Link: https://lesswrong.com/posts/bqWihHtDnDseyfF2T/edge-cases-in-ai-alignment-2
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- ai-deception-evals — Evals