Toward universal steering and monitoring of AI models
Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adserà, Mikhail Belkin — 2025-05-28 — arXiv
Summary
Develops a scalable approach for extracting linear representations of general concepts from large-scale AI models, demonstrating their effectiveness for both steering model behavior and monitoring misaligned content like hallucinations and toxicity.
Key Result
Concept representations enable effective steering to mitigate misaligned behaviors and predictive monitoring models using these representations are more accurate for detecting hallucinations and toxic content than models that judge outputs directly.
Source
- Link: https://arxiv.org/abs/2502.03708
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- monitoring-concepts — White-box safety (i.e. Interpretability) / Concept-based interpretability