Preference Learning for AI Alignment: a Causal Perspective

Katarzyna Kobalczyk, Mihaela van der Schaar — 2025-06-06 — arXiv

Summary

Applies causal inference framework to preference learning for LLM alignment, identifying challenges like causal misidentification and preference heterogeneity, demonstrating failure modes of naive reward models, and proposing causally-inspired approaches for robustness.

Source