Mechanistic Anomaly Detection for “Quirky” Language Models

David O. Johnston, Arkajyoti Chakraborty, Nora Belrose — 2025-04-09 — FAR AI — arXiv (ICLR Building Trust Workshop 2025)

Summary

Investigates Mechanistic Anomaly Detection (MAD) for augmenting supervision of capable language models by using internal model features to identify anomalous training signals that differ substantially from the training environment, testing a variety of detector features and scoring rules on quirky language models.

Key Result

Detectors can achieve high discrimination on some tasks, but no detector is effective across all models and tasks, suggesting MAD may work for low-stakes applications but needs advances for high-stakes settings.

Source

Link: https://arxiv.org/abs/2504.08812
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- black-box-make-ai-solve-it — Black-box safety (understand and control current model behaviour) / Iterative alignment

black-box-make-ai-solve-it

AI Safety Compendium

Explorer

Mechanistic Anomaly Detection for "Quirky" Language Models

Mechanistic Anomaly Detection for “Quirky” Language Models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Mechanistic Anomaly Detection for "Quirky" Language Models

Mechanistic Anomaly Detection for “Quirky” Language Models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents