Evaluation Limitations

Comprehensive catalog of limitations in AI safety evaluation — fundamental, technical, sandbagging-related, systemic, and governance-level. The AI Safety Atlas (Ch.5.9) treats these as substantial but addressable — progress requires sustained effort.

Fundamental Challenges

Absence of Evidence vs. Evidence of Absence

Core asymmetry: tests can confirm capabilities exist but cannot conclusively prove absence of concerning behaviors. Persistent uncertainty even when systems pass multiple safety assessments.

This is the foundational epistemic limit of all evaluation work.

Interpreting Negative Results

When models fail to demonstrate capabilities, unclear whether genuine limitation or inadequate evaluation. GPT-4 unexpectedly controlled Minecraft through proper scaffolding — capabilities can be hidden.

Unknown Unknowns

Frameworks limited by ability to anticipate what requires testing. As systems become more capable, behaviors emerge that no one thought to test.

Technical Challenges

Measurement Precision

Performance fluctuates up to 76% based on subtle prompt formatting
~5 percentage point shifts from trivial modifications

Environmental Sensitivity

Results vary based on model versions, sampling parameters, compute infrastructure, time-of-day factors — many unmeasured.

Practical Constraints

Limited model access for third-party evaluators
Restricted compute and financial barriers
One R&D capabilities evaluation: 568+ hours for just 7 tasks

Combinatorial Explosion

Multi-component AI systems create entirely new risk surfaces through unpredictable interactions — impossibly vast to comprehensively test.

Resource Reality

Comprehensive evaluations demand prohibitive compute. 100,000+ API calls per benchmark.

Maximum Capabilities Estimation

Models display unexpected abilities through novel scaffolding/scaling. The Kalamang translation example — Gemini translated a 100-speaker Papua New Guinea language after ingesting its grammar book.

Sandbagging Challenges

Behavioral Testing Limits

Behavioral observation alone leaves blind spots. A model providing correct answers might use undetectable internal strategies. Without interpretability progress, evaluations have definite safety guarantee limits.

Strategic Underperformance

Frontier models including Claude 3.5 Sonnet and o1 can deliberately underperform when believing performance triggers restrictions. See sandbagging-evals.

Implication

Black-box testing alone insufficient. Evaluators require deeper access — trusted third-party evaluators with secured weight and architecture access may become necessary. Regulatory institutions like AISIs are partial answers.

Systemic Limitations

Safety Washing

Companies might misrepresent capability improvements as safety advances. Safety benchmarks correlated strongly with general capabilities → scaling improves safety scores without making systems safer.

Three mechanisms:

Research labeled “safety” by team origin
Companies highlight correlated safety metrics under regulatory pressure
Researchers frame capabilities work in safety terms for funders

This is the indifference amplifier from Ch.2 in operational form.

Corporate Sandbagging

Regulatory frameworks create incentives for developers to appear below capability thresholds triggering oversight. EU AI Act high-risk requirements may motivate strategic underperformance.

Evaluation-Deployment Gap

Controlled evaluation rarely captures deployment complexity. Medical advice model performs well on benchmarks but poses serious risks where users can’t validate recommendations.

Domain-Specific Challenges

Biosecurity and cybersecurity evaluations require rare combinations of AI expertise + deep domain knowledge. Properly testing maximum capabilities in high-stakes domains presents ethical challenges.

Governance Limitations

Policy-Evaluation Gap

Regulatory efforts requiring safety assessments lack robust standardized methods. Circular dependency: regulators need standards for policy; standard development depends partly on regulatory requirements.

Independence Issues

Most AI evaluation research occurs within developing companies — inherent conflict of interest. Companies face incentives designing evaluations their systems pass.

Expanding Independent Evaluation

METR, Apollo Research, and the AISI network expand the ecosystem but face significant barriers: limited model access, resource constraints, difficulty matching major lab capabilities.

The “Substantial but Addressable” Stance

The Atlas’s framing: limitations are real but not insurmountable. Progress in:

Mechanistic interpretability
Formal verification
Evaluation protocols
Independent evaluation institutional capacity

…all show promise. Overcoming these challenges demands sustained effort and investment.

Strategic Implications

These limitations shape the rest of the safety-strategy landscape:

Why defense-in-depth matters — no single evaluation is reliable; combine
Why ai-control matters — alignment evaluation is hard; control as backup
Why AISIs matter — independent evaluation institutional capacity
Why interpretability is urgent — closes some of the irreducible black-box gaps

Connection to Wiki

capability-evaluations — substantially deepened by these limitations
sandbagging-evals — SR2025 agenda for the sandbagging problem
interpretability — closes some black-box gaps
risk-amplifiers — safety washing is a Ch.2 amplifier
ai-safety-institute — independent evaluation institutional response
metr — independent evaluation org

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

AI Safety Atlas Ch.5 — Limitations — referenced as [[atlas-ch5-evaluations-09-limitations]]

AI Safety Compendium

Explorer

Evaluation Limitations

Evaluation Limitations

Fundamental Challenges

Absence of Evidence vs. Evidence of Absence

Interpreting Negative Results

Unknown Unknowns

Technical Challenges

Measurement Precision

Environmental Sensitivity

Practical Constraints

Combinatorial Explosion

Resource Reality

Maximum Capabilities Estimation

Sandbagging Challenges

Behavioral Testing Limits

Strategic Underperformance

Implication

Systemic Limitations

Safety Washing

Corporate Sandbagging

Evaluation-Deployment Gap

Domain-Specific Challenges

Governance Limitations

Policy-Evaluation Gap

Independence Issues

Expanding Independent Evaluation

The “Substantial but Addressable” Stance

Strategic Implications

Connection to Wiki

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Evaluation Limitations

Evaluation Limitations

Fundamental Challenges

Absence of Evidence vs. Evidence of Absence

Interpreting Negative Results

Unknown Unknowns

Technical Challenges

Measurement Precision

Environmental Sensitivity

Practical Constraints

Combinatorial Explosion

Resource Reality

Maximum Capabilities Estimation

Sandbagging Challenges

Behavioral Testing Limits

Strategic Underperformance

Implication

Systemic Limitations

Safety Washing

Corporate Sandbagging

Evaluation-Deployment Gap

Domain-Specific Challenges

Governance Limitations

Policy-Evaluation Gap

Independence Issues

Expanding Independent Evaluation

The “Substantial but Addressable” Stance

Strategic Implications

Connection to Wiki

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks