Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Daniel Tan, Anders Woodruff, Niels Warncke, Arun Jose, Maxime Riché, David Demitri Africa, … (+1 more) — 2025-10-05 — arXiv

Summary

Proposes inoculation prompting, a finetuning method that deliberately elicits undesirable traits during training via system prompts but evaluates without them at test time, achieving selective suppression of unwanted behaviors while maintaining desired ones.

Key Result

Inoculation prompting reduces emergent misalignment, defends against backdoor injections, and mitigates subliminal learning by making traits less surprising during training, thereby reducing optimization pressure for global model updates.

Source

Link: https://arxiv.org/abs/2510.04340
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- inoculation-prompting — Black-box safety (understand and control current model behaviour) / Iterative alignment
- character-training-and-persona-steering — Black-box safety (understand and control current model behaviour) / Model psychology

AI Safety Compendium

Explorer

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents