Evaluation Limitations

Comprehensive catalog of limitations in AI safety evaluation — fundamental, technical, sandbagging-related, systemic, and governance-level. The AI Safety Atlas (Ch.5.9) treats these as substantial but addressable — progress requires sustained effort.

Fundamental Challenges

Absence of Evidence vs. Evidence of Absence

Core asymmetry: tests can confirm capabilities exist but cannot conclusively prove absence of concerning behaviors. Persistent uncertainty even when systems pass multiple safety assessments.

This is the foundational epistemic limit of all evaluation work.

Interpreting Negative Results

When models fail to demonstrate capabilities, unclear whether genuine limitation or inadequate evaluation. GPT-4 unexpectedly controlled Minecraft through proper scaffolding — capabilities can be hidden.

Unknown Unknowns

Frameworks limited by ability to anticipate what requires testing. As systems become more capable, behaviors emerge that no one thought to test.

Technical Challenges

Measurement Precision

  • Performance fluctuates up to 76% based on subtle prompt formatting
  • ~5 percentage point shifts from trivial modifications

Environmental Sensitivity

Results vary based on model versions, sampling parameters, compute infrastructure, time-of-day factors — many unmeasured.

Practical Constraints

  • Limited model access for third-party evaluators
  • Restricted compute and financial barriers
  • One R&D capabilities evaluation: 568+ hours for just 7 tasks

Combinatorial Explosion

Multi-component AI systems create entirely new risk surfaces through unpredictable interactions — impossibly vast to comprehensively test.

Resource Reality

Comprehensive evaluations demand prohibitive compute. 100,000+ API calls per benchmark.

Maximum Capabilities Estimation

Models display unexpected abilities through novel scaffolding/scaling. The Kalamang translation example — Gemini translated a 100-speaker Papua New Guinea language after ingesting its grammar book.

Sandbagging Challenges

Behavioral Testing Limits

Behavioral observation alone leaves blind spots. A model providing correct answers might use undetectable internal strategies. Without interpretability progress, evaluations have definite safety guarantee limits.

Strategic Underperformance

Frontier models including Claude 3.5 Sonnet and o1 can deliberately underperform when believing performance triggers restrictions. See sandbagging-evals.

Implication

Black-box testing alone insufficient. Evaluators require deeper access — trusted third-party evaluators with secured weight and architecture access may become necessary. Regulatory institutions like AISIs are partial answers.

Systemic Limitations

Safety Washing

Companies might misrepresent capability improvements as safety advances. Safety benchmarks correlated strongly with general capabilities → scaling improves safety scores without making systems safer.

Three mechanisms:

  1. Research labeled “safety” by team origin
  2. Companies highlight correlated safety metrics under regulatory pressure
  3. Researchers frame capabilities work in safety terms for funders

This is the indifference amplifier from Ch.2 in operational form.

Corporate Sandbagging

Regulatory frameworks create incentives for developers to appear below capability thresholds triggering oversight. EU AI Act high-risk requirements may motivate strategic underperformance.

Evaluation-Deployment Gap

Controlled evaluation rarely captures deployment complexity. Medical advice model performs well on benchmarks but poses serious risks where users can’t validate recommendations.

Domain-Specific Challenges

Biosecurity and cybersecurity evaluations require rare combinations of AI expertise + deep domain knowledge. Properly testing maximum capabilities in high-stakes domains presents ethical challenges.

Governance Limitations

Policy-Evaluation Gap

Regulatory efforts requiring safety assessments lack robust standardized methods. Circular dependency: regulators need standards for policy; standard development depends partly on regulatory requirements.

Independence Issues

Most AI evaluation research occurs within developing companies — inherent conflict of interest. Companies face incentives designing evaluations their systems pass.

Expanding Independent Evaluation

METR, Apollo Research, and the AISI network expand the ecosystem but face significant barriers: limited model access, resource constraints, difficulty matching major lab capabilities.

The “Substantial but Addressable” Stance

The Atlas’s framing: limitations are real but not insurmountable. Progress in:

  • Mechanistic interpretability
  • Formal verification
  • Evaluation protocols
  • Independent evaluation institutional capacity

…all show promise. Overcoming these challenges demands sustained effort and investment.

Strategic Implications

These limitations shape the rest of the safety-strategy landscape:

  • Why defense-in-depth matters — no single evaluation is reliable; combine
  • Why ai-control matters — alignment evaluation is hard; control as backup
  • Why AISIs matter — independent evaluation institutional capacity
  • Why interpretability is urgent — closes some of the irreducible black-box gaps

Connection to Wiki

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.