Data Governance

Data governance is the regulation of AI training data — its collection, quality, content, and use — as a lever for shaping AI capabilities and risks. The AI Safety Atlas (Ch.4.7 appendix) treats data as secondary to compute for governance purposes, despite its meaningfulness, because data is harder to measure and control.

Why Data Matters for AI Risk

For frontier foundation models, training data influences both capabilities and alignment:

“Garbage in, garbage out” — low-quality or harmful data → misaligned/dangerous models
Carefully curated datasets can promote safer behavior
Data shapes both what systems do and how they do it

Data Against the Three Governance Criteria

Criterion	Data
Measurability	Mixed — quantities yes, quality/content far harder
Controllability	Limited — non-rival (copyable); model distillation extracts info from trained models
Meaningfulness	High — directly shapes capabilities and behaviors

Compared to compute (which scores high on all three), data is meaningful but harder to operationalize.

Six Key Data Governance Concerns

1. Training Data Quality and Safety

Low-quality or harmful data → unreliable or dangerous models. Bioweapon technical data could enable AI assistance in their development. The pre-training analog of post-training unlearning.

2. Data Poisoning and Security

Malicious actors deliberately manipulate training data → models behave dangerously in specific situations while appearing safe during testing. See the SR2025 data-poisoning-defense agenda.

3. Data Provenance and Accountability

Clear tracking of training-data sources helps diagnose and fix problems when models exhibit concerning behaviors. The forensic side of safety.

Many current AI training practices operate in legal/ethical grey areas regarding data usage rights. Clear frameworks could prevent unauthorized use while enabling legitimate innovation. The eu-ai-act addresses this partially.

5. Bias and Representation

Skewed datasets → poor performance for some groups, amplifying societal inequity. The mainstream “AI ethics” frame’s primary concern.

Without governance, risk either:

Over-concentration of power (few actors with large datasets)
Uncontrolled proliferation of dangerous capabilities

Why Data Is Secondary to Compute

The Atlas’s strategic call: “While data governance remains important, other governance targets may offer more direct governance over frontier AI development in the near term. The main text focuses primarily on compute governance, which provides more concrete control points through its physical and concentrated nature.”

Reasons:

Non-rival nature — once data exists, controlling spread is very hard
Distillation bypass — even restricted data can be extracted from trained models
Synthetic data — alternative sources can circumvent restrictions
Emergent behavior — concerning capabilities can emerge from seemingly innocuous training data through unexpected interactions

What Data Governance Can Do

Despite limitations, promising control points exist:

Original data collection — provenance requirements, rights frameworks, consent norms
Initial foundation-model training — pre-training filtering (data-filtering SR2025 agenda)
Documented data sources — model registries with training-data manifests

These are upstream interventions that compound through the development pipeline.

Connection to Wiki

compute-governance — primary alternative governance lever
governance-problems — the three-criterion framework
ai-governance — parent concept
data-poisoning-defense, data-filtering, data-quality-for-alignment — SR2025 agendas
machine-unlearning — post-training counterpart
eu-ai-act — partial data-governance instance
atlas-ch4-governance-07-appendix-data-governance — primary source

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

AI Safety Atlas Ch.4 — Appendix: Data Governance — referenced as [[atlas-ch4-governance-07-appendix-data-governance]]

AI Safety Compendium

Explorer

Data Governance

Data Governance

Why Data Matters for AI Risk

Data Against the Three Governance Criteria

Six Key Data Governance Concerns

1. Training Data Quality and Safety

2. Data Poisoning and Security

3. Data Provenance and Accountability

5. Bias and Representation

Why Data Is Secondary to Compute

What Data Governance Can Do

Connection to Wiki

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Data Governance

Data Governance

Why Data Matters for AI Risk

Data Against the Three Governance Criteria

Six Key Data Governance Concerns

1. Training Data Quality and Safety

2. Data Poisoning and Security

3. Data Provenance and Accountability

4. Consent and Rights Frameworks

5. Bias and Representation

6. Data Access and Sharing Protocols

Why Data Is Secondary to Compute

What Data Governance Can Do

Connection to Wiki

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks