Harm reduction for open weights — SR2025 Agenda Snapshot
One-sentence summary: Develops methods, primarily based on pretraining data intervention, to create tamper-resistant safeguards that prevent open-weight models from being maliciously fine-tuned to remove safety features or exploit dangerous capabilities.
Theory of Change
Open-weight models allow adversaries to easily remove post-training safety (like refusal training) via simple fine-tuning; by making safety an intrinsic property of the model’s learned knowledge and capabilities (e.g., by ensuring “deep ignorance” of dual-use information), the safeguards become far more difficult and expensive to remove.
Broad Approach
engineering
Target Case
average
Orthodox Problems Addressed
Someone else will deploy unsafe superintelligence first
Key People
Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Rishub Tamirisa, Mantas Mazeika, Stella Biderman, Yarin Gal
Funding
UK AI Safety Institute (AISI), EleutherAI, Coefficient Giving
Estimated FTEs: 10-100
See Also
data-filtering, capability-removal-unlearning, data-poisoning-defense
Outputs in 2025
5 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: harm-reduction-for-open-weights (these were generated alongside this file from the same export).
Source
- Row in
shallow-review-2025/agendas.csv(name = Harm reduction for open weights) — Shallow Review of Technical AI Safety 2025.
Related Pages
- ai-safety
- ai-safety
- capability-removal-unlearning
- data-filtering
- data-poisoning-defense
- assistance-games-assistive-agents
- black-box-make-ai-solve-it
- chain-of-thought-monitoring
- character-training-and-persona-steering
- control
- data-quality-for-alignment
- emergent-misalignment
- hyperstition-studies
- inference-time-in-context-learning
- inference-time-steering
- inoculation-prompting
- iterative-alignment-at-post-train-time
- iterative-alignment-at-pretrain-time
- mild-optimisation
- model-psychopathology
- model-specs-and-constitutions
- model-values-model-preferences
- rl-safety
- safeguards-inference-time-auxiliaries
- synthetic-data-for-alignment
- the-neglected-approaches-approach
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]