Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, Fazl Barez — 2025-09-30 — arXiv

Summary

Introduces Truncated Polynomial Classifiers (TPCs) for dynamic activation monitoring of LLMs, enabling adaptive safety guardrails that can early-stop for cheap monitoring or evaluate more terms for stronger detection of harmful requests.

Key Result

TPCs match or outperform MLP-based probe baselines on two large-scale safety datasets (WildGuardMix and BeaverTails) across 4 models up to 30B parameters, while providing interpretable and adjustable safety monitoring.

Source