Circuit Breakers (AI Safety)

Circuit breakers are a technical safeguard that detects and interrupts internal activation patterns associated with harmful outputs, building safety mechanisms directly into models rather than relying on input/output filtering. Inspired by representation engineering, circuit breakers represent one of the more promising intrinsic safeguards for misuse-prevention-strategies.

How They Work

The technique uses methods like Representation Rerouting (RR) and LoRRA (Low-Rank Representation Adaptation):

Detect specific activation patterns associated with harmful generation
“Break the circuit” by rerouting harmful representations
Prevent toxic content while preserving utility on benign requests

This targets intrinsic model harm capacity rather than the surface text. By operating on internal representations, circuit breakers are theoretically more robust than I/O filtering — surface-level filters can be bypassed via paraphrasing, encoding, indirect requests; circuit breakers operate at the representational level where the harmful concept is encoded.

Why “More Robust Than I/O Filtering”

I/O filtering operates on text:

Black-box, easy to bypass with creative phrasing
Doesn’t address whether the model “knows” how to do the harmful thing
Filter can be jailbroken or fine-tuned away

Circuit breakers operate on internal activations:

Closer to the model’s actual reasoning
Harder to bypass since the harmful concept is intercepted before it reaches output
Can complement I/O filtering as a deeper layer in defense-in-depth

Limitations

Circuit breakers don’t solve all misuse problems:

Open-weight models — circuit breakers can be fine-tuned away (the tamper-resistance problem; see machine-unlearning)
Out-of-distribution attacks — adversarial inputs may trigger unforeseen activation patterns the circuit breakers don’t catch
Imperfect detection — false negatives let harmful outputs through; false positives degrade utility

Connection to Wiki

Circuit breakers fit in the misuse-prevention-strategies technical-safeguards layer alongside machine-unlearning and tamper-resistant fine-tuning research. Related research areas:

interpretability — circuit breakers depend on interpretability progress to identify harmful activation patterns
activation-engineering — SR2025 agenda; circuit breakers are an applied form
capability-removal-unlearning — the SR2025 agenda for selective capability removal
harm-reduction-for-open-weights — SR2025 agenda relevant to tamper-resistance

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.

AI Safety Atlas Ch.3 — Misuse Prevention Strategies — referenced as [[atlas-ch3-strategies-03-misuse-prevention-strategies]]

AI Safety Compendium

Explorer

Circuit Breakers (AI Safety)

Circuit Breakers (AI Safety)

How They Work

Why “More Robust Than I/O Filtering”

Limitations

Connection to Wiki

Sources cited

Graph View

Graph view

Table of Contents

Backlinks

AI Safety Compendium

Explorer

Circuit Breakers (AI Safety)

Circuit Breakers (AI Safety)

How They Work

Why “More Robust Than I/O Filtering”

Limitations

Connection to Wiki

Related Pages

Sources cited

Graph View

Graph view

Table of Contents

Backlinks