The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence

Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, Johannes Gasteiger — 2025-02-24 — Technical University of Munich — arXiv

Summary

Proposes a novel gradient-based approach to identify refusal directions in LLMs, discovering multiple independent directions and multi-dimensional concept cones that mediate refusal behavior, introducing the concept of representational independence to account for both linear and non-linear intervention effects.

Key Result

Refusal in LLMs is governed by multiple mechanistically independent directions and complex spatial structures (concept cones) rather than a single refusal direction, with orthogonality alone insufficient to establish independence under intervention.

Source