Mechanistic Anomaly Detection for “Quirky” Language Models

David O. Johnston, Arkajyoti Chakraborty, Nora Belrose — 2025-04-09 — FAR AI — arXiv (ICLR Building Trust Workshop 2025)

Summary

Investigates Mechanistic Anomaly Detection (MAD) for augmenting supervision of capable language models by using internal model features to identify anomalous training signals that differ substantially from the training environment, testing a variety of detector features and scoring rules on quirky language models.

Key Result

Detectors can achieve high discrimination on some tasks, but no detector is effective across all models and tasks, suggesting MAD may work for low-stakes applications but needs advances for high-stakes settings.

Source