Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, Fazl Barez — 2025-09-30 — arXiv
Summary
Introduces Truncated Polynomial Classifiers (TPCs) for dynamic activation monitoring of LLMs, enabling adaptive safety guardrails that can early-stop for cheap monitoring or evaluate more terms for stronger detection of harmful requests.
Key Result
TPCs match or outperform MLP-based probe baselines on two large-scale safety datasets (WildGuardMix and BeaverTails) across 4 models up to 30B parameters, while providing interpretable and adjustable safety monitoring.
Source
- Link: https://arxiv.org/abs/2509.26238
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- monitoring-concepts — White-box safety (i.e. Interpretability) / Concept-based interpretability