Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models
Alex Turner — 2025-03-01 — turntrout.com
Summary
Proposes that training data describing AI systems as misaligned could create self-fulfilling misalignment through out-of-context reasoning, synthesizes evidence from multiple studies, and suggests testing methodologies and technical mitigations including data filtering, conditional pretraining, and gradient routing.
Source
- Link: https://turntrout.com/self-fulfilling-misalignment
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- hyperstition-studies — Black-box safety (understand and control current model behaviour) / Better data