Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models

Alex Turner — 2025-03-01 — turntrout.com

Summary

Proposes that training data describing AI systems as misaligned could create self-fulfilling misalignment through out-of-context reasoning, synthesizes evidence from multiple studies, and suggests testing methodologies and technical mitigations including data filtering, conditional pretraining, and gradient routing.

Source