AI Safety Atlas Ch.5 — Limitations
Source: Limitations
Comprehensive catalog of evaluation limitations across fundamental, technical, sandbagging, systemic, and governance dimensions. See evaluation-limitations.
Fundamental Challenges
Absence of Evidence vs. Evidence of Absence
Core asymmetry: tests can confirm capabilities exist but cannot conclusively prove absence of concerning behaviors. Persistent uncertainty even when systems pass multiple safety assessments.
Interpreting Negative Results
When models fail to demonstrate capabilities, unclear whether genuine limitation or inadequate evaluation. GPT-4 unexpectedly controlled Minecraft through proper scaffolding — capabilities can be hidden.
Unknown Unknowns
Frameworks limited by ability to anticipate what requires testing. As systems become more capable, behaviors emerge that no one thought to test.
Technical Challenges
Measurement Precision Issues
Language models exhibit extreme sensitivity to minor evaluation variations:
- Performance fluctuates up to 76% based on subtle prompt formatting
- ~5 percentage point shifts from trivial modifications (option labeling style)
Environmental Sensitivity
Results vary based on model versions, sampling parameters, compute infrastructure, time-of-day factors — many unmeasured.
Practical Constraints
Third-party evaluators face limited model access, restricted compute, financial barriers. One R&D capabilities evaluation consumed over 568 hours for just 7 tasks.
Combinatorial Explosion
Multi-component systems create entirely new risk surfaces through unpredictable interactions — impossibly vast to comprehensively test.
Resource Reality
Comprehensive evaluations demand prohibitive compute. 100,000+ API calls for single benchmarks at scale.
Maximum Capabilities Estimation
Models display unexpected abilities through novel scaffolding/scaling. Kalamang translation example — Gemini translated a 100-speaker Papua New Guinea language after ingesting its grammar book.
Sandbagging
Behavioral Testing Limits
Behavioral observation alone leaves blind spots. A model providing correct answers might use undetectable internal strategies. Without interpretability progress, evaluations have definite safety guarantee limits.
Strategic Underperformance
Frontier models including Claude 3.5 Sonnet and o1 can deliberately underperform when believing performance triggers restrictions.
Implication
Black-box testing alone insufficient. Evaluators require deeper access — trusted third-party evaluators with secured weight and architecture access may become necessary. See sandbagging-evals.
Systemic Limitations
Safety Washing
Companies might misrepresent capability improvements as safety advances. Safety benchmarks correlated strongly with general capabilities → scaling improves safety scores without making systems safer.
Three safety washing mechanisms:
- Research labeled safety-relevant simply by team origin
- Companies highlight correlated safety metrics under regulatory pressure
- Researchers frame capabilities work in safety terms for funders
Corporate Sandbagging
Regulatory frameworks create incentives for developers to appear below capability thresholds triggering oversight. EU AI Act high-risk requirements may motivate strategic underperformance in evaluations while maintaining actual deployment capabilities.
Evaluation-Deployment Gap
Controlled evaluation rarely captures deployment complexity. Medical advice model performing well on benchmarks may pose serious risks in actual healthcare where users can’t validate recommendations.
Domain-Specific Challenges
Biosecurity and cybersecurity evaluations require rare combinations of AI expertise + deep domain knowledge. Properly testing maximum capabilities in high-stakes domains presents ethical challenges.
Governance Limitations
Policy-Evaluation Gap
Regulatory efforts requiring safety assessments lack robust standardized methods. Circular dependency: regulators need standards for policy; standards depend on regulatory requirements. Government agencies often lack specialized AI knowledge.
Independence Issues
Most AI evaluation research occurs within developing companies — inherent conflict of interest. Companies face incentives designing evaluations their systems likely pass.
Expanding Independent Evaluation
Organizations like METR and Apollo Research expand the ecosystem but face significant barriers: limited model access, resource constraints, difficulty matching major lab capabilities.
Why These Limitations Matter
This subchapter is the chapter’s honest reckoning. Evaluations are the operational backbone of RSPs, if-then commitments, red lines, EU AI Act enforcement — yet they have fundamental, irreducible limits.
The Atlas’s framing: limitations are “substantial but addressable.” Progress in mechanistic interpretability, formal verification, evaluation protocols shows promise. Overcoming requires sustained effort and investment.
Connection to Wiki
- evaluation-limitations — dedicated concept page
- capability-evaluations — substantially deepened
- sandbagging-evals — SR2025 agenda for the sandbagging problem
- interpretability — what closes some of these gaps
- risk-amplifiers — safety washing is a Ch.2 amplifier
- ai-safety-institute — independent evaluation institutional response
- metr — independent evaluation org