AI Safety Atlas Ch.4 — Appendix: Data Governance
Source: Appendix: Data Governance
“Data fundamentally shapes AI capabilities and risks, but it is challenging to regulate due to being easy to copy, and difficult to measure and control.” See data-governance.
Why Data Matters for AI Risk
For frontier foundation models, training data influences both capabilities and alignment — what systems can do and how they do it. “Garbage in, garbage out” — low-quality or harmful training data → misaligned or dangerous models. Carefully curated datasets can promote safer behavior.
Data Against the Three Governance Criteria
Measurability — Mixed
- Raw data quantities measurable
- Quality, content, implications far more difficult to assess
- Unlike physical goods, data can be copied/modified/transmitted in untrackable ways
Controllability — Limited
- Data is non-rival — once it exists, controlling spread is very difficult
- Even when data appears restricted, model distillation extracts info from trained models
- However: promising control points exist around original data collection and initial foundation-model training
Meaningfulness — High
- Training data directly shapes capabilities and behaviors
- Changes in training data significantly impact model performance and safety
Six Key Data Governance Concerns
- Training data quality and safety — low quality/harmful data → unreliable/dangerous models. Bioweapon technical data could enable AI assistance.
- Data poisoning and security — malicious actors deliberately manipulate training data → models behave dangerously in specific situations while appearing safe. See data-poisoning-defense.
- Data provenance and accountability — clear tracking of training-data sources helps diagnose problems
- Consent and rights frameworks — current AI training operates in legal/ethical grey areas
- Bias and representation — skewed datasets → poor performance for some groups, amplifying inequity
- Data access and sharing protocols — without governance, risk either over-concentration of power (few actors with large datasets) or uncontrolled proliferation
Why the Atlas Treats Data as Secondary to Compute
The Atlas’s strategic call: “While data governance remains important, other governance targets may offer more direct governance over frontier AI development in the near term. The main text focuses primarily on compute-governance, which provides more concrete control points through its physical and concentrated nature.”
Data is meaningful but harder to control than compute. Compute is the right primary target; data is a secondary support layer.
Connection to Wiki
- data-governance — dedicated concept page
- compute-governance — primary alternative
- data-poisoning-defense, data-filtering, data-quality-for-alignment — SR2025 agendas covering technical aspects
- governance-problems — the meta-framework
- machine-unlearning — post-training counterpart when data filtering fails