Harm reduction for open weights — SR2025 Agenda Snapshot

One-sentence summary: Develops methods, primarily based on pretraining data intervention, to create tamper-resistant safeguards that prevent open-weight models from being maliciously fine-tuned to remove safety features or exploit dangerous capabilities.

Theory of Change

Open-weight models allow adversaries to easily remove post-training safety (like refusal training) via simple fine-tuning; by making safety an intrinsic property of the model’s learned knowledge and capabilities (e.g., by ensuring “deep ignorance” of dual-use information), the safeguards become far more difficult and expensive to remove.

Broad Approach

engineering

Target Case

average

Orthodox Problems Addressed

Someone else will deploy unsafe superintelligence first

Key People

Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Rishub Tamirisa, Mantas Mazeika, Stella Biderman, Yarin Gal

Funding

UK AI Safety Institute (AISI), EleutherAI, Coefficient Giving

Estimated FTEs: 10-100

Outputs in 2025

5 item(s) in the review. See the wiki/summaries/ entries with frontmatter agenda: harm-reduction-for-open-weights (these were generated alongside this file from the same export).

Source

Row in shallow-review-2025/agendas.csv (name = Harm reduction for open weights) — Shallow Review of Technical AI Safety 2025.

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

Summary: AI Safety (Wikipedia) — referenced as [[ai-safety]]

AI Safety Compendium

Explorer

Harm reduction for open weights

Harm reduction for open weights — SR2025 Agenda Snapshot

Theory of Change

Broad Approach

Target Case

Orthodox Problems Addressed

Key People

Funding

See Also

Outputs in 2025

Source

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Harm reduction for open weights

Harm reduction for open weights — SR2025 Agenda Snapshot

Theory of Change

Broad Approach

Target Case

Orthodox Problems Addressed

Key People

Funding

See Also

Outputs in 2025

Source

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks