Circuit Breakers (AI Safety)

Circuit breakers are a technical safeguard that detects and interrupts internal activation patterns associated with harmful outputs, building safety mechanisms directly into models rather than relying on input/output filtering. Inspired by representation engineering, circuit breakers represent one of the more promising intrinsic safeguards for misuse-prevention-strategies.

How They Work

The technique uses methods like Representation Rerouting (RR) and LoRRA (Low-Rank Representation Adaptation):

  • Detect specific activation patterns associated with harmful generation
  • “Break the circuit” by rerouting harmful representations
  • Prevent toxic content while preserving utility on benign requests

This targets intrinsic model harm capacity rather than the surface text. By operating on internal representations, circuit breakers are theoretically more robust than I/O filtering — surface-level filters can be bypassed via paraphrasing, encoding, indirect requests; circuit breakers operate at the representational level where the harmful concept is encoded.

Why “More Robust Than I/O Filtering”

I/O filtering operates on text:

  • Black-box, easy to bypass with creative phrasing
  • Doesn’t address whether the model “knows” how to do the harmful thing
  • Filter can be jailbroken or fine-tuned away

Circuit breakers operate on internal activations:

  • Closer to the model’s actual reasoning
  • Harder to bypass since the harmful concept is intercepted before it reaches output
  • Can complement I/O filtering as a deeper layer in defense-in-depth

Limitations

Circuit breakers don’t solve all misuse problems:

  • Open-weight models — circuit breakers can be fine-tuned away (the tamper-resistance problem; see machine-unlearning)
  • Out-of-distribution attacks — adversarial inputs may trigger unforeseen activation patterns the circuit breakers don’t catch
  • Imperfect detection — false negatives let harmful outputs through; false positives degrade utility

Connection to Wiki

Circuit breakers fit in the misuse-prevention-strategies technical-safeguards layer alongside machine-unlearning and tamper-resistant fine-tuning research. Related research areas:

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.