Harm reduction for open weights — SR2025 Agenda Snapshot

One-sentence summary: Develops methods, primarily based on pretraining data intervention, to create tamper-resistant safeguards that prevent open-weight models from being maliciously fine-tuned to remove safety features or exploit dangerous capabilities.

Theory of Change

Open-weight models allow adversaries to easily remove post-training safety (like refusal training) via simple fine-tuning; by making safety an intrinsic property of the model’s learned knowledge and capabilities (e.g., by ensuring “deep ignorance” of dual-use information), the safeguards become far more difficult and expensive to remove.

Broad Approach

engineering

Target Case

average

Orthodox Problems Addressed

Someone else will deploy unsafe superintelligence first

Key People

Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Rishub Tamirisa, Mantas Mazeika, Stella Biderman, Yarin Gal

Funding

UK AI Safety Institute (AISI), EleutherAI, Coefficient Giving

Estimated FTEs: 10-100

See Also

data-filtering, capability-removal-unlearning, data-poisoning-defense

Outputs in 2025

5 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: harm-reduction-for-open-weights (these were generated alongside this file from the same export).

Source

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.