The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence

Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, Johannes Gasteiger — 2025-02-24 — Technical University of Munich — arXiv

Summary

Proposes a novel gradient-based approach to identify refusal directions in LLMs, discovering multiple independent directions and multi-dimensional concept cones that mediate refusal behavior, introducing the concept of representational independence to account for both linear and non-linear intervention effects.

Key Result

Refusal in LLMs is governed by multiple mechanistically independent directions and complex spatial structures (concept cones) rather than a single refusal direction, with orthogonality alone insufficient to establish independence under intervention.

Source

Link: https://arxiv.org/abs/2502.17420
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- representation-structure-and-geometry — White-box safety (i.e. Interpretability)
- monitoring-concepts — White-box safety (i.e. Interpretability) / Concept-based interpretability

AI Safety Compendium

Explorer

The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence

The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence

The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents