AI Safety Atlas Ch.5 — Limitations

Source: Limitations

Comprehensive catalog of evaluation limitations across fundamental, technical, sandbagging, systemic, and governance dimensions. See evaluation-limitations.

Fundamental Challenges

Absence of Evidence vs. Evidence of Absence

Core asymmetry: tests can confirm capabilities exist but cannot conclusively prove absence of concerning behaviors. Persistent uncertainty even when systems pass multiple safety assessments.

Interpreting Negative Results

When models fail to demonstrate capabilities, unclear whether genuine limitation or inadequate evaluation. GPT-4 unexpectedly controlled Minecraft through proper scaffolding — capabilities can be hidden.

Unknown Unknowns

Frameworks limited by ability to anticipate what requires testing. As systems become more capable, behaviors emerge that no one thought to test.

Technical Challenges

Measurement Precision Issues

Language models exhibit extreme sensitivity to minor evaluation variations:

  • Performance fluctuates up to 76% based on subtle prompt formatting
  • ~5 percentage point shifts from trivial modifications (option labeling style)

Environmental Sensitivity

Results vary based on model versions, sampling parameters, compute infrastructure, time-of-day factors — many unmeasured.

Practical Constraints

Third-party evaluators face limited model access, restricted compute, financial barriers. One R&D capabilities evaluation consumed over 568 hours for just 7 tasks.

Combinatorial Explosion

Multi-component systems create entirely new risk surfaces through unpredictable interactions — impossibly vast to comprehensively test.

Resource Reality

Comprehensive evaluations demand prohibitive compute. 100,000+ API calls for single benchmarks at scale.

Maximum Capabilities Estimation

Models display unexpected abilities through novel scaffolding/scaling. Kalamang translation example — Gemini translated a 100-speaker Papua New Guinea language after ingesting its grammar book.

Sandbagging

Behavioral Testing Limits

Behavioral observation alone leaves blind spots. A model providing correct answers might use undetectable internal strategies. Without interpretability progress, evaluations have definite safety guarantee limits.

Strategic Underperformance

Frontier models including Claude 3.5 Sonnet and o1 can deliberately underperform when believing performance triggers restrictions.

Implication

Black-box testing alone insufficient. Evaluators require deeper access — trusted third-party evaluators with secured weight and architecture access may become necessary. See sandbagging-evals.

Systemic Limitations

Safety Washing

Companies might misrepresent capability improvements as safety advances. Safety benchmarks correlated strongly with general capabilities → scaling improves safety scores without making systems safer.

Three safety washing mechanisms:

  1. Research labeled safety-relevant simply by team origin
  2. Companies highlight correlated safety metrics under regulatory pressure
  3. Researchers frame capabilities work in safety terms for funders

Corporate Sandbagging

Regulatory frameworks create incentives for developers to appear below capability thresholds triggering oversight. EU AI Act high-risk requirements may motivate strategic underperformance in evaluations while maintaining actual deployment capabilities.

Evaluation-Deployment Gap

Controlled evaluation rarely captures deployment complexity. Medical advice model performing well on benchmarks may pose serious risks in actual healthcare where users can’t validate recommendations.

Domain-Specific Challenges

Biosecurity and cybersecurity evaluations require rare combinations of AI expertise + deep domain knowledge. Properly testing maximum capabilities in high-stakes domains presents ethical challenges.

Governance Limitations

Policy-Evaluation Gap

Regulatory efforts requiring safety assessments lack robust standardized methods. Circular dependency: regulators need standards for policy; standards depend on regulatory requirements. Government agencies often lack specialized AI knowledge.

Independence Issues

Most AI evaluation research occurs within developing companies — inherent conflict of interest. Companies face incentives designing evaluations their systems likely pass.

Expanding Independent Evaluation

Organizations like METR and Apollo Research expand the ecosystem but face significant barriers: limited model access, resource constraints, difficulty matching major lab capabilities.

Why These Limitations Matter

This subchapter is the chapter’s honest reckoning. Evaluations are the operational backbone of RSPs, if-then commitments, red lines, EU AI Act enforcement — yet they have fundamental, irreducible limits.

The Atlas’s framing: limitations are “substantial but addressable.” Progress in mechanistic interpretability, formal verification, evaluation protocols shows promise. Overcoming requires sustained effort and investment.

Connection to Wiki