The Structural Safety Generalization Problem
Julius Broomfield, Tom Gibbs, Ethan Kosak-Hine, George Ingebretsen, Tia Nasir, Jason Zhang, … (+4 more) — 2025-04-13 — arXiv
Summary
Identifies and demonstrates the structural safety generalization problem in LLMs - where safety fails to generalize across semantically equivalent inputs with different structures - through systematic red-teaming of multi-turn, multi-image, and translation-based jailbreak attacks, and proposes a Structure Rewriting Guardrail defense mechanism.
Key Result
The Structure Rewriting Guardrail significantly improves refusal of harmful inputs without over-refusing benign ones by converting inputs to structures more conducive to safety assessment.
Source
- Link: https://arxiv.org/abs/2504.09712
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals