AI Safety Atlas Ch.4 — Appendix: Data Governance

Source: Appendix: Data Governance

“Data fundamentally shapes AI capabilities and risks, but it is challenging to regulate due to being easy to copy, and difficult to measure and control.” See data-governance.

Why Data Matters for AI Risk

For frontier foundation models, training data influences both capabilities and alignment — what systems can do and how they do it. “Garbage in, garbage out” — low-quality or harmful training data → misaligned or dangerous models. Carefully curated datasets can promote safer behavior.

Data Against the Three Governance Criteria

Measurability — Mixed

Raw data quantities measurable
Quality, content, implications far more difficult to assess
Unlike physical goods, data can be copied/modified/transmitted in untrackable ways

Controllability — Limited

Data is non-rival — once it exists, controlling spread is very difficult
Even when data appears restricted, model distillation extracts info from trained models
However: promising control points exist around original data collection and initial foundation-model training

Meaningfulness — High

Training data directly shapes capabilities and behaviors
Changes in training data significantly impact model performance and safety

Six Key Data Governance Concerns

Training data quality and safety — low quality/harmful data → unreliable/dangerous models. Bioweapon technical data could enable AI assistance.
Data poisoning and security — malicious actors deliberately manipulate training data → models behave dangerously in specific situations while appearing safe. See data-poisoning-defense.
Data provenance and accountability — clear tracking of training-data sources helps diagnose problems
Consent and rights frameworks — current AI training operates in legal/ethical grey areas
Bias and representation — skewed datasets → poor performance for some groups, amplifying inequity
Data access and sharing protocols — without governance, risk either over-concentration of power (few actors with large datasets) or uncontrolled proliferation

Why the Atlas Treats Data as Secondary to Compute

The Atlas’s strategic call: “While data governance remains important, other governance targets may offer more direct governance over frontier AI development in the near term. The main text focuses primarily on compute-governance, which provides more concrete control points through its physical and concentrated nature.”

Data is meaningful but harder to control than compute. Compute is the right primary target; data is a secondary support layer.

Connection to Wiki

data-governance — dedicated concept page
compute-governance — primary alternative
data-poisoning-defense, data-filtering, data-quality-for-alignment — SR2025 agendas covering technical aspects
governance-problems — the meta-framework
machine-unlearning — post-training counterpart when data filtering fails

AI Safety Compendium

Explorer

AI Safety Atlas Ch.4 — Appendix: Data Governance

AI Safety Atlas Ch.4 — Appendix: Data Governance

Why Data Matters for AI Risk

Data Against the Three Governance Criteria

Measurability — Mixed

Controllability — Limited

Meaningfulness — High

Six Key Data Governance Concerns

Why the Atlas Treats Data as Secondary to Compute

Connection to Wiki

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

AI Safety Atlas Ch.4 — Appendix: Data Governance

AI Safety Atlas Ch.4 — Appendix: Data Governance

Why Data Matters for AI Risk

Data Against the Three Governance Criteria

Measurability — Mixed

Controllability — Limited

Meaningfulness — High

Six Key Data Governance Concerns

Why the Atlas Treats Data as Secondary to Compute

Connection to Wiki

Related Pages

Graph View

Graph view

Table of Contents

Backlinks