Edge Cases in AI Alignment

Florian Dietz — 2025-03-24 — MATS Program — LessWrong

Summary

Creates an automated benchmark of ~800 prompts across 32 topics testing unusual but impactful AI alignment scenarios, with heavy LLM automation for prompt generation and evaluation.

Key Result

A simple corrigibility intervention in system prompts reduced alignment faking propensities across all tested frontier models (GPT-4o, Claude 3.5/3.7, DeepSeek R1), and models show striking disparities between stated and actual alignment faking behavior.

Source