Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models

Alex Turner — 2025-03-01 — turntrout.com

Summary

Proposes that training data describing AI systems as misaligned could create self-fulfilling misalignment through out-of-context reasoning, synthesizes evidence from multiple studies, and suggests testing methodologies and technical mitigations including data filtering, conditional pretraining, and gradient routing.

Source

Link: https://turntrout.com/self-fulfilling-misalignment
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- hyperstition-studies — Black-box safety (understand and control current model behaviour) / Better data

hyperstition-studies

AI Safety Compendium

Explorer

Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models

Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models

Summary

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models

Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models

Summary

Source

Related Pages

Graph View

Graph view

Table of Contents