The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models
Ann-Kathrin Dombrowski, Dillon Bowen, Adam Gleave, Chris Cundy — 2025-07-08 — Alignment Research — arXiv
Summary
Develops and open-sources a toolkit to measure the ‘safety gap’ - the difference in dangerous capabilities between open-weight models with intact safeguards versus those with safeguards removed. Evaluates biochemical and cyber capabilities across Llama-3 and Qwen-2.5 model families using various safeguard removal techniques.
Key Result
The safety gap widens as model scale increases, with dangerous biochemical and cyber capabilities growing substantially when safeguards are removed from larger models.
Source
- Link: https://arxiv.org/abs/2507.11544
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):