Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Daniel Tan, Anders Woodruff, Niels Warncke, Arun Jose, Maxime Riché, David Demitri Africa, … (+1 more) — 2025-10-05 — arXiv

Summary

Proposes inoculation prompting, a finetuning method that deliberately elicits undesirable traits during training via system prompts but evaluates without them at test time, achieving selective suppression of unwanted behaviors while maintaining desired ones.

Key Result

Inoculation prompting reduces emergent misalignment, defends against backdoor injections, and mitigates subliminal learning by making traits less surprising during training, thereby reducing optimization pressure for global model updates.

Source