The Structural Safety Generalization Problem

Julius Broomfield, Tom Gibbs, Ethan Kosak-Hine, George Ingebretsen, Tia Nasir, Jason Zhang, … (+4 more) — 2025-04-13 — arXiv

Summary

Identifies and demonstrates the structural safety generalization problem in LLMs - where safety fails to generalize across semantically equivalent inputs with different structures - through systematic red-teaming of multi-turn, multi-image, and translation-based jailbreak attacks, and proposes a Structure Rewriting Guardrail defense mechanism.

Key Result

The Structure Rewriting Guardrail significantly improves refusal of harmful inputs without over-refusing benign ones by converting inputs to structures more conducive to safety assessment.

Source

Link: https://arxiv.org/abs/2504.09712
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

The Structural Safety Generalization Problem

The Structural Safety Generalization Problem

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

The Structural Safety Generalization Problem

The Structural Safety Generalization Problem

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents