Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time
Daniel Tan, Anders Woodruff, Niels Warncke, Arun Jose, Maxime Riché, David Demitri Africa, … (+1 more) — 2025-10-05 — arXiv
Summary
Proposes inoculation prompting, a finetuning method that deliberately elicits undesirable traits during training via system prompts but evaluates without them at test time, achieving selective suppression of unwanted behaviors while maintaining desired ones.
Key Result
Inoculation prompting reduces emergent misalignment, defends against backdoor injections, and mitigates subliminal learning by making traits less surprising during training, thereby reducing optimization pressure for global model updates.
Source
- Link: https://arxiv.org/abs/2510.04340
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- inoculation-prompting — Black-box safety (understand and control current model behaviour) / Iterative alignment
- character-training-and-persona-steering — Black-box safety (understand and control current model behaviour) / Model psychology