AI Safety Atlas Ch.4 — Appendix: Data Governance

Source: Appendix: Data Governance

“Data fundamentally shapes AI capabilities and risks, but it is challenging to regulate due to being easy to copy, and difficult to measure and control.” See data-governance.

Why Data Matters for AI Risk

For frontier foundation models, training data influences both capabilities and alignment — what systems can do and how they do it. “Garbage in, garbage out” — low-quality or harmful training data → misaligned or dangerous models. Carefully curated datasets can promote safer behavior.

Data Against the Three Governance Criteria

Measurability — Mixed

  • Raw data quantities measurable
  • Quality, content, implications far more difficult to assess
  • Unlike physical goods, data can be copied/modified/transmitted in untrackable ways

Controllability — Limited

  • Data is non-rival — once it exists, controlling spread is very difficult
  • Even when data appears restricted, model distillation extracts info from trained models
  • However: promising control points exist around original data collection and initial foundation-model training

Meaningfulness — High

  • Training data directly shapes capabilities and behaviors
  • Changes in training data significantly impact model performance and safety

Six Key Data Governance Concerns

  1. Training data quality and safety — low quality/harmful data → unreliable/dangerous models. Bioweapon technical data could enable AI assistance.
  2. Data poisoning and security — malicious actors deliberately manipulate training data → models behave dangerously in specific situations while appearing safe. See data-poisoning-defense.
  3. Data provenance and accountability — clear tracking of training-data sources helps diagnose problems
  4. Consent and rights frameworks — current AI training operates in legal/ethical grey areas
  5. Bias and representation — skewed datasets → poor performance for some groups, amplifying inequity
  6. Data access and sharing protocols — without governance, risk either over-concentration of power (few actors with large datasets) or uncontrolled proliferation

Why the Atlas Treats Data as Secondary to Compute

The Atlas’s strategic call: “While data governance remains important, other governance targets may offer more direct governance over frontier AI development in the near term. The main text focuses primarily on compute-governance, which provides more concrete control points through its physical and concentrated nature.”

Data is meaningful but harder to control than compute. Compute is the right primary target; data is a secondary support layer.

Connection to Wiki