Robustness
Robustness in AI refers to a system’s ability to perform reliably across a wide range of inputs, including inputs that differ from its training distribution, inputs that have been deliberately modified to cause failures (adversarial examples), or novel situations the system has not encountered before.
Why Robustness Matters for Safety
A model that behaves well during training and evaluation but fails in deployment is dangerous precisely because the failure is hidden. catherine-olsson frames robustness as the second half of the safety problem: once you have given a system the right objective, you must ensure it optimizes for that objective consistently — not just on familiar inputs.
This connects directly to deceptive-alignment: a model might appear aligned during training but pursue different goals when encountering out-of-distribution inputs in deployment.
Types of Robustness Failures
- Distribution shift: The input distribution at deployment differs from training (see distribution-shift).
- Adversarial examples: Inputs crafted specifically to cause failures, often imperceptible to humans.
- Specification gaming: Finding ways to satisfy the stated objective while violating its intent — a failure of robustness to the gap between specification and intent.
- Out-of-context failure: Behaviors that work in isolation but break down in combination with other system components.
Adversarial Robustness
A specific subfield documented at length in ai-safety:
- Adversarial examples: Szegedy et al. 2013 showed that imperceptible perturbations could cause image classifiers to misclassify with high confidence. The phenomenon persists in modern neural networks, though contemporary perturbations are typically large enough to be perceptible.
- Reward-model gaming: language models trained against a reward model long enough will exploit vulnerabilities in the reward model, achieving higher scores while performing worse on the intended task. Adversarially robust reward models partially mitigate this.
- AI evaluating AI: any system used to evaluate or monitor another AI system must itself be adversarially robust — including monitoring tools, which could be tampered with to produce higher reward.
- LLM-specific attacks: large language models are vulnerable to prompt injection (instructions embedded in inputs to bypass safety measures), model stealing, and use as misinformation generators.
- Audio and security domains: speech-to-text systems can be made to transcribe attacker-chosen messages from imperceptibly modified audio. Network intrusion and malware detection systems must be adversarially robust because attackers will design attacks to fool detectors.
Fault Tolerance and Redundancy
Beyond robustness of individual models, safety-critical AI work proposes architectural redundancy: multiple independently developed or trained models processing the same input, with consensus or voter mechanisms aggregating outputs. This reduces the risk that a single faulty, compromised, or deceptive model causes harm.
Relationship to Other Safety Areas
Robustness failures are one of the primary motivations for:
- interpretability — understanding why models make decisions to predict where failures occur
- scalable-oversight — catching failures that emerge in deployment at scale
- capability-evaluations — testing robustness before deployment
- ai-control — operating safely despite potential failures
Related Pages
- distribution-shift
- deceptive-alignment
- interpretability
- scalable-oversight
- capability-evaluations
- ai-control
- reward-learning
- rlhf
- catherine-olsson
- 80k-podcast-olsson-ziegler-ml-engineering
- ai-safety
- ai-safety
- concrete-problems-in-ai-safety
- daniel-ziegler
- deepmind
- openai
- elsa
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: 80,000 Hours Podcast — Catherine Olsson & Daniel Ziegler on ML Engineering and Safety — referenced as
[[80k-podcast-olsson-ziegler-ml-engineering]] - Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]