Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference

Stephen Zhao, Aidan Li, Rob Brekelmans, Roger Grosse — 2025-10-24 — arXiv

Summary

Introduces RePULSe, a new training method that augments standard RL alignment with an additional loss using learned proposals to sample and reduce the probability of low-reward outputs in language models.

Key Result

RePULSe produces a better tradeoff between expected reward and probability of undesired outputs, with improved adversarial robustness compared to standard RL alignment approaches.

Source