Summary: Control protocols don’t always need to know which models are scheming
Source metadata
- Author: Fabien Roger (LessWrong post)
- Published: 2026-04-26
- Type: LessWrong post (personal views, not a peer-reviewed paper)
TL;DR
When monitoring an agent for catastrophically dangerous actions, the standard recipe — use the smartest model that’s still too weak to scheme — requires knowing the scheming threshold, which is unobservable. Roger argues you can sidestep this by ensembling monitors of varying capability and blocking on any positive flag. In toy models, this matches the safety of a protocol that knew which models were scheming, without requiring that knowledge.
Key claims
- The trustedness threshold is unknown. Identifying which models are weak enough to be unable to scheme is difficult; reliable methods for this don’t currently exist (source).
- Ensemble-of-monitors achieves equivalent safety without the threshold. In a two-monitor toy model, blocking when any monitor flags an action provides safety bounds equivalent to using only the highest trusted monitor — provided the scheming model can’t coordinate across monitors (source).
- The cost is uncertainty about the achieved safety level. The protocol matches the safety of an oracle protocol but doesn’t reveal which monitors did the work — practitioners don’t learn the actual safety floor (source).
- Practical consequences: Removes the need for stakeholders to agree on which models are trusted, and reduces the value of spending effort to identify the scheming threshold (source).
Methods
Toy models with two monitors of differing capability. Analytic comparison between (a) using only the strongest trusted monitor versus (b) ensembling all monitors with disjunctive blocking. The post notes its conclusions are derived from these toy models, not empirical experiments.
Limitations
- Toy-model only. No empirical validation against frontier models or realistic deployment settings.
- Assumes independent monitor failures. If schemers can coordinate across monitors (e.g., shared training data leading to correlated blind spots), the ensemble’s safety advantage erodes — the post acknowledges this implicitly via the “at once” framing.
- Doesn’t address false-positive cost. Disjunctive blocking maximizes recall at the expense of precision, which has real deployment costs the post doesn’t quantify.
- Self-described as personal views, not a vetted Redwood Research output.
How this updates our concepts/agendas
This post operationalizes a practical scalable-oversight technique that is robust to uncertainty about which models are scheming. It refines the standard control-protocol playbook: instead of selecting one trusted monitor, ensemble across the trustedness frontier.
For deceptive-alignment: reduces the operational consequence of not knowing whether a given model is deceptively aligned, by structuring oversight to remain safe under that uncertainty.
For scalable-oversight: adds a specific protocol-design pattern (disjunctive ensemble) to the toolkit, alongside debate, sandwiching, and self-critique.