Data poisoning defense — SR2025 Agenda Snapshot
One-sentence summary: Develops methods to detect and prevent malicious or backdoor-inducing samples from being included in the training data.
Theory of Change
By identifying and filtering out malicious training examples, we can prevent attackers from creating hidden backdoors or triggers that would cause aligned models to behave dangerously.
Broad Approach
engineering
Target Case
pessimistic
Orthodox Problems Addressed
Superintelligence can hack software supervisors, Someone else will deploy unsafe superintelligence first
Key People
Alexandra Souly, Javier Rando, Ed Chapman, Hanna Foerster, Ilia Shumailov, Yiren Zhao
Funding
Google DeepMind, Anthropic, University of Cambridge, Vector Institute
Estimated FTEs: 5-20
Critiques
A small number of samples can poison LLMs of any size, Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated
See Also
data-filtering, safeguards-inference-time-auxiliaries, various-redteams, adversarial robustness
Outputs in 2025
3 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: data-poisoning-defense (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = Data poisoning defense) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- data-filtering
- safeguards-inference-time-auxiliaries
- various-redteams
- assistance-games-assistive-agents
- black-box-make-ai-solve-it
- capability-removal-unlearning
- chain-of-thought-monitoring
- character-training-and-persona-steering
- control
- data-quality-for-alignment
- emergent-misalignment
- harm-reduction-for-open-weights
- hyperstition-studies
- inference-time-in-context-learning
- inference-time-steering
- inoculation-prompting
- iterative-alignment-at-post-train-time
- iterative-alignment-at-pretrain-time
- mild-optimisation
- model-psychopathology
- model-specs-and-constitutions
- model-values-model-preferences
- rl-safety
- synthetic-data-for-alignment
- the-neglected-approaches-approach
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]