Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference

Stephen Zhao, Aidan Li, Rob Brekelmans, Roger Grosse — 2025-10-24 — arXiv

Summary

Introduces RePULSe, a new training method that augments standard RL alignment with an additional loss using learned proposals to sample and reduce the probability of low-reward outputs in language models.

Key Result

RePULSe produces a better tradeoff between expected reward and probability of undesired outputs, with improved adversarial robustness compared to standard RL alignment approaches.

Source

Link: https://arxiv.org/abs/2510.21184
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- iterative-alignment-at-post-train-time — Black-box safety (understand and control current model behaviour) / Iterative alignment
- rl-safety — Black-box safety (understand and control current model behaviour) / Goal robustness

iterative-alignment-at-post-train-time
rl-safety

AI Safety Compendium

Explorer

Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference

Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference

Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents