Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference
Stephen Zhao, Aidan Li, Rob Brekelmans, Roger Grosse — 2025-10-24 — arXiv
Summary
Introduces RePULSe, a new training method that augments standard RL alignment with an additional loss using learned proposals to sample and reduce the probability of low-reward outputs in language models.
Key Result
RePULSe produces a better tradeoff between expected reward and probability of undesired outputs, with improved adversarial robustness compared to standard RL alignment approaches.
Source
- Link: https://arxiv.org/abs/2510.21184
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- iterative-alignment-at-post-train-time — Black-box safety (understand and control current model behaviour) / Iterative alignment
- rl-safety — Black-box safety (understand and control current model behaviour) / Goal robustness