The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, Johannes Gasteiger — 2025-02-24 — Technical University of Munich — arXiv
Summary
Proposes a novel gradient-based approach to identify refusal directions in LLMs, discovering multiple independent directions and multi-dimensional concept cones that mediate refusal behavior, introducing the concept of representational independence to account for both linear and non-linear intervention effects.
Key Result
Refusal in LLMs is governed by multiple mechanistically independent directions and complex spatial structures (concept cones) rather than a single refusal direction, with orthogonality alone insufficient to establish independence under intervention.
Source
- Link: https://arxiv.org/abs/2502.17420
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- representation-structure-and-geometry — White-box safety (i.e. Interpretability)
- monitoring-concepts — White-box safety (i.e. Interpretability) / Concept-based interpretability