The Structural Safety Generalization Problem

Julius Broomfield, Tom Gibbs, Ethan Kosak-Hine, George Ingebretsen, Tia Nasir, Jason Zhang, … (+4 more) — 2025-04-13 — arXiv

Summary

Identifies and demonstrates the structural safety generalization problem in LLMs - where safety fails to generalize across semantically equivalent inputs with different structures - through systematic red-teaming of multi-turn, multi-image, and translation-based jailbreak attacks, and proposes a Structure Rewriting Guardrail defense mechanism.

Key Result

The Structure Rewriting Guardrail significantly improves refusal of harmful inputs without over-refusing benign ones by converting inputs to structures more conducive to safety assessment.

Source