Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

Jiaxin Wen, Vivek Hebbar, Caleb Larson, Aryan Bhatt, Ansh Radhakrishnan, Mrinank Sharma, … (+6 more) — 2024-11-26 — arXiv

Summary

Develops and evaluates a two-level adaptive deployment framework for untrusted LLMs that uses trusted models to monitor and control untrusted models, maintaining adaptive credence about alignment based on past behavior to choose between safer and riskier micro-protocols.

Key Result

The adaptive deployment strategy reduces backdoored code by 80% compared to non-adaptive baselines at a given level of usefulness in a code generation testbed.

Source