Circuit Breakers (AI Safety)
Circuit breakers are a technical safeguard that detects and interrupts internal activation patterns associated with harmful outputs, building safety mechanisms directly into models rather than relying on input/output filtering. Inspired by representation engineering, circuit breakers represent one of the more promising intrinsic safeguards for misuse-prevention-strategies.
How They Work
The technique uses methods like Representation Rerouting (RR) and LoRRA (Low-Rank Representation Adaptation):
- Detect specific activation patterns associated with harmful generation
- “Break the circuit” by rerouting harmful representations
- Prevent toxic content while preserving utility on benign requests
This targets intrinsic model harm capacity rather than the surface text. By operating on internal representations, circuit breakers are theoretically more robust than I/O filtering — surface-level filters can be bypassed via paraphrasing, encoding, indirect requests; circuit breakers operate at the representational level where the harmful concept is encoded.
Why “More Robust Than I/O Filtering”
I/O filtering operates on text:
- Black-box, easy to bypass with creative phrasing
- Doesn’t address whether the model “knows” how to do the harmful thing
- Filter can be jailbroken or fine-tuned away
Circuit breakers operate on internal activations:
- Closer to the model’s actual reasoning
- Harder to bypass since the harmful concept is intercepted before it reaches output
- Can complement I/O filtering as a deeper layer in defense-in-depth
Limitations
Circuit breakers don’t solve all misuse problems:
- Open-weight models — circuit breakers can be fine-tuned away (the tamper-resistance problem; see machine-unlearning)
- Out-of-distribution attacks — adversarial inputs may trigger unforeseen activation patterns the circuit breakers don’t catch
- Imperfect detection — false negatives let harmful outputs through; false positives degrade utility
Connection to Wiki
Circuit breakers fit in the misuse-prevention-strategies technical-safeguards layer alongside machine-unlearning and tamper-resistant fine-tuning research. Related research areas:
- interpretability — circuit breakers depend on interpretability progress to identify harmful activation patterns
- activation-engineering — SR2025 agenda; circuit breakers are an applied form
- capability-removal-unlearning — the SR2025 agenda for selective capability removal
- harm-reduction-for-open-weights — SR2025 agenda relevant to tamper-resistance
Related Pages
- misuse-prevention-strategies
- machine-unlearning
- interpretability
- activation-engineering
- capability-removal-unlearning
- harm-reduction-for-open-weights
- defense-in-depth
- ai-safety-atlas-textbook
- atlas-ch3-strategies-03-misuse-prevention-strategies
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- AI Safety Atlas Ch.3 — Misuse Prevention Strategies — referenced as
[[atlas-ch3-strategies-03-misuse-prevention-strategies]]