Data Governance

Data governance is the regulation of AI training data — its collection, quality, content, and use — as a lever for shaping AI capabilities and risks. The AI Safety Atlas (Ch.4.7 appendix) treats data as secondary to compute for governance purposes, despite its meaningfulness, because data is harder to measure and control.

Why Data Matters for AI Risk

For frontier foundation models, training data influences both capabilities and alignment:

  • “Garbage in, garbage out” — low-quality or harmful data → misaligned/dangerous models
  • Carefully curated datasets can promote safer behavior
  • Data shapes both what systems do and how they do it

Data Against the Three Governance Criteria

CriterionData
MeasurabilityMixed — quantities yes, quality/content far harder
ControllabilityLimited — non-rival (copyable); model distillation extracts info from trained models
MeaningfulnessHigh — directly shapes capabilities and behaviors

Compared to compute (which scores high on all three), data is meaningful but harder to operationalize.

Six Key Data Governance Concerns

1. Training Data Quality and Safety

Low-quality or harmful data → unreliable or dangerous models. Bioweapon technical data could enable AI assistance in their development. The pre-training analog of post-training unlearning.

2. Data Poisoning and Security

Malicious actors deliberately manipulate training data → models behave dangerously in specific situations while appearing safe during testing. See the SR2025 data-poisoning-defense agenda.

3. Data Provenance and Accountability

Clear tracking of training-data sources helps diagnose and fix problems when models exhibit concerning behaviors. The forensic side of safety.

Many current AI training practices operate in legal/ethical grey areas regarding data usage rights. Clear frameworks could prevent unauthorized use while enabling legitimate innovation. The eu-ai-act addresses this partially.

5. Bias and Representation

Skewed datasets → poor performance for some groups, amplifying societal inequity. The mainstream “AI ethics” frame’s primary concern.

6. Data Access and Sharing Protocols

Without governance, risk either:

  • Over-concentration of power (few actors with large datasets)
  • Uncontrolled proliferation of dangerous capabilities

Why Data Is Secondary to Compute

The Atlas’s strategic call: “While data governance remains important, other governance targets may offer more direct governance over frontier AI development in the near term. The main text focuses primarily on compute governance, which provides more concrete control points through its physical and concentrated nature.”

Reasons:

  • Non-rival nature — once data exists, controlling spread is very hard
  • Distillation bypass — even restricted data can be extracted from trained models
  • Synthetic data — alternative sources can circumvent restrictions
  • Emergent behavior — concerning capabilities can emerge from seemingly innocuous training data through unexpected interactions

What Data Governance Can Do

Despite limitations, promising control points exist:

  • Original data collection — provenance requirements, rights frameworks, consent norms
  • Initial foundation-model training — pre-training filtering (data-filtering SR2025 agenda)
  • Documented data sources — model registries with training-data manifests

These are upstream interventions that compound through the development pipeline.

Connection to Wiki

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.