The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

Ann-Kathrin Dombrowski, Dillon Bowen, Adam Gleave, Chris Cundy — 2025-07-08 — Alignment Research — arXiv

Summary

Develops and open-sources a toolkit to measure the ‘safety gap’ - the difference in dangerous capabilities between open-weight models with intact safeguards versus those with safeguards removed. Evaluates biochemical and cyber capabilities across Llama-3 and Qwen-2.5 model families using various safeguard removal techniques.

Key Result

The safety gap widens as model scale increases, with dangerous biochemical and cyber capabilities growing substantially when safeguards are removed from larger models.

Source

Link: https://arxiv.org/abs/2507.11544
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- wmd-evals-weapons-of-mass-destruction — Evals

wmd-evals-weapons-of-mass-destruction

AI Safety Compendium

Explorer

The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents