Evaluation Limitations
Comprehensive catalog of limitations in AI safety evaluation — fundamental, technical, sandbagging-related, systemic, and governance-level. The AI Safety Atlas (Ch.5.9) treats these as substantial but addressable — progress requires sustained effort.
Fundamental Challenges
Absence of Evidence vs. Evidence of Absence
Core asymmetry: tests can confirm capabilities exist but cannot conclusively prove absence of concerning behaviors. Persistent uncertainty even when systems pass multiple safety assessments.
This is the foundational epistemic limit of all evaluation work.
Interpreting Negative Results
When models fail to demonstrate capabilities, unclear whether genuine limitation or inadequate evaluation. GPT-4 unexpectedly controlled Minecraft through proper scaffolding — capabilities can be hidden.
Unknown Unknowns
Frameworks limited by ability to anticipate what requires testing. As systems become more capable, behaviors emerge that no one thought to test.
Technical Challenges
Measurement Precision
- Performance fluctuates up to 76% based on subtle prompt formatting
- ~5 percentage point shifts from trivial modifications
Environmental Sensitivity
Results vary based on model versions, sampling parameters, compute infrastructure, time-of-day factors — many unmeasured.
Practical Constraints
- Limited model access for third-party evaluators
- Restricted compute and financial barriers
- One R&D capabilities evaluation: 568+ hours for just 7 tasks
Combinatorial Explosion
Multi-component AI systems create entirely new risk surfaces through unpredictable interactions — impossibly vast to comprehensively test.
Resource Reality
Comprehensive evaluations demand prohibitive compute. 100,000+ API calls per benchmark.
Maximum Capabilities Estimation
Models display unexpected abilities through novel scaffolding/scaling. The Kalamang translation example — Gemini translated a 100-speaker Papua New Guinea language after ingesting its grammar book.
Sandbagging Challenges
Behavioral Testing Limits
Behavioral observation alone leaves blind spots. A model providing correct answers might use undetectable internal strategies. Without interpretability progress, evaluations have definite safety guarantee limits.
Strategic Underperformance
Frontier models including Claude 3.5 Sonnet and o1 can deliberately underperform when believing performance triggers restrictions. See sandbagging-evals.
Implication
Black-box testing alone insufficient. Evaluators require deeper access — trusted third-party evaluators with secured weight and architecture access may become necessary. Regulatory institutions like AISIs are partial answers.
Systemic Limitations
Safety Washing
Companies might misrepresent capability improvements as safety advances. Safety benchmarks correlated strongly with general capabilities → scaling improves safety scores without making systems safer.
Three mechanisms:
- Research labeled “safety” by team origin
- Companies highlight correlated safety metrics under regulatory pressure
- Researchers frame capabilities work in safety terms for funders
This is the indifference amplifier from Ch.2 in operational form.
Corporate Sandbagging
Regulatory frameworks create incentives for developers to appear below capability thresholds triggering oversight. EU AI Act high-risk requirements may motivate strategic underperformance.
Evaluation-Deployment Gap
Controlled evaluation rarely captures deployment complexity. Medical advice model performs well on benchmarks but poses serious risks where users can’t validate recommendations.
Domain-Specific Challenges
Biosecurity and cybersecurity evaluations require rare combinations of AI expertise + deep domain knowledge. Properly testing maximum capabilities in high-stakes domains presents ethical challenges.
Governance Limitations
Policy-Evaluation Gap
Regulatory efforts requiring safety assessments lack robust standardized methods. Circular dependency: regulators need standards for policy; standard development depends partly on regulatory requirements.
Independence Issues
Most AI evaluation research occurs within developing companies — inherent conflict of interest. Companies face incentives designing evaluations their systems pass.
Expanding Independent Evaluation
METR, Apollo Research, and the AISI network expand the ecosystem but face significant barriers: limited model access, resource constraints, difficulty matching major lab capabilities.
The “Substantial but Addressable” Stance
The Atlas’s framing: limitations are real but not insurmountable. Progress in:
- Mechanistic interpretability
- Formal verification
- Evaluation protocols
- Independent evaluation institutional capacity
…all show promise. Overcoming these challenges demands sustained effort and investment.
Strategic Implications
These limitations shape the rest of the safety-strategy landscape:
- Why defense-in-depth matters — no single evaluation is reliable; combine
- Why ai-control matters — alignment evaluation is hard; control as backup
- Why AISIs matter — independent evaluation institutional capacity
- Why interpretability is urgent — closes some of the irreducible black-box gaps
Connection to Wiki
- capability-evaluations — substantially deepened by these limitations
- sandbagging-evals — SR2025 agenda for the sandbagging problem
- interpretability — closes some black-box gaps
- risk-amplifiers — safety washing is a Ch.2 amplifier
- ai-safety-institute — independent evaluation institutional response
- metr — independent evaluation org
Related Pages
- capability-evaluations
- evaluated-properties
- evaluation-techniques
- sandbagging-evals
- interpretability
- risk-amplifiers
- ai-safety-institute
- metr
- defense-in-depth
- ai-control
- ai-safety-atlas-textbook
- atlas-ch5-evaluations-09-limitations
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- AI Safety Atlas Ch.5 — Limitations — referenced as
[[atlas-ch5-evaluations-09-limitations]]