Data Governance
Data governance is the regulation of AI training data — its collection, quality, content, and use — as a lever for shaping AI capabilities and risks. The AI Safety Atlas (Ch.4.7 appendix) treats data as secondary to compute for governance purposes, despite its meaningfulness, because data is harder to measure and control.
Why Data Matters for AI Risk
For frontier foundation models, training data influences both capabilities and alignment:
- “Garbage in, garbage out” — low-quality or harmful data → misaligned/dangerous models
- Carefully curated datasets can promote safer behavior
- Data shapes both what systems do and how they do it
Data Against the Three Governance Criteria
| Criterion | Data |
|---|---|
| Measurability | Mixed — quantities yes, quality/content far harder |
| Controllability | Limited — non-rival (copyable); model distillation extracts info from trained models |
| Meaningfulness | High — directly shapes capabilities and behaviors |
Compared to compute (which scores high on all three), data is meaningful but harder to operationalize.
Six Key Data Governance Concerns
1. Training Data Quality and Safety
Low-quality or harmful data → unreliable or dangerous models. Bioweapon technical data could enable AI assistance in their development. The pre-training analog of post-training unlearning.
2. Data Poisoning and Security
Malicious actors deliberately manipulate training data → models behave dangerously in specific situations while appearing safe during testing. See the SR2025 data-poisoning-defense agenda.
3. Data Provenance and Accountability
Clear tracking of training-data sources helps diagnose and fix problems when models exhibit concerning behaviors. The forensic side of safety.
4. Consent and Rights Frameworks
Many current AI training practices operate in legal/ethical grey areas regarding data usage rights. Clear frameworks could prevent unauthorized use while enabling legitimate innovation. The eu-ai-act addresses this partially.
5. Bias and Representation
Skewed datasets → poor performance for some groups, amplifying societal inequity. The mainstream “AI ethics” frame’s primary concern.
6. Data Access and Sharing Protocols
Without governance, risk either:
- Over-concentration of power (few actors with large datasets)
- Uncontrolled proliferation of dangerous capabilities
Why Data Is Secondary to Compute
The Atlas’s strategic call: “While data governance remains important, other governance targets may offer more direct governance over frontier AI development in the near term. The main text focuses primarily on compute governance, which provides more concrete control points through its physical and concentrated nature.”
Reasons:
- Non-rival nature — once data exists, controlling spread is very hard
- Distillation bypass — even restricted data can be extracted from trained models
- Synthetic data — alternative sources can circumvent restrictions
- Emergent behavior — concerning capabilities can emerge from seemingly innocuous training data through unexpected interactions
What Data Governance Can Do
Despite limitations, promising control points exist:
- Original data collection — provenance requirements, rights frameworks, consent norms
- Initial foundation-model training — pre-training filtering (data-filtering SR2025 agenda)
- Documented data sources — model registries with training-data manifests
These are upstream interventions that compound through the development pipeline.
Connection to Wiki
- compute-governance — primary alternative governance lever
- governance-problems — the three-criterion framework
- ai-governance — parent concept
- data-poisoning-defense, data-filtering, data-quality-for-alignment — SR2025 agendas
- machine-unlearning — post-training counterpart
- eu-ai-act — partial data-governance instance
- atlas-ch4-governance-07-appendix-data-governance — primary source
Related Pages
- compute-governance
- governance-problems
- ai-governance
- data-poisoning-defense
- data-filtering
- data-quality-for-alignment
- machine-unlearning
- eu-ai-act
- ai-safety-atlas-textbook
- atlas-ch4-governance-07-appendix-data-governance
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- AI Safety Atlas Ch.4 — Appendix: Data Governance — referenced as
[[atlas-ch4-governance-07-appendix-data-governance]]