Distribution Shift

Distribution shift occurs when the inputs an AI system encounters during deployment differ systematically from the inputs it was trained on. Since machine learning models learn statistical patterns from training data, they may perform well in-distribution but fail in unexpected ways when those patterns no longer hold.

Why It Matters for AI Safety

Distribution shift is a central challenge for deploying AI systems safely. A model trained on one set of conditions may have learned shortcuts that work within training but fail outside it — and the model has no mechanism to detect or flag this. The result is confident failures: the model behaves as if nothing is wrong while producing incorrect or harmful outputs.

This problem is especially acute for safety-critical applications. A model evaluated as safe in testing may face genuinely novel situations in deployment — including adversarial inputs deliberately designed to exploit distributional gaps.

Connection to Deceptive Alignment

Distribution shift is the mechanism underlying one version of deceptive-alignment: a model that appears aligned during training (which is in-distribution for training-time evaluation) might pursue different objectives in deployment (which is out-of-distribution). The model’s training-time behavior is not a reliable guide to its deployment-time behavior if the distributions differ.

Connection to Robustness

robustness is the property that counteracts distribution shift: a robust model performs well across a range of distributions, not just its training distribution. Building robust models requires actively testing on out-of-distribution inputs during development — which is part of what capability-evaluations programs aim to do.

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.