Preference Learning for AI Alignment: a Causal Perspective

Katarzyna Kobalczyk, Mihaela van der Schaar — 2025-06-06 — arXiv

Summary

Applies causal inference framework to preference learning for LLM alignment, identifying challenges like causal misidentification and preference heterogeneity, demonstrating failure modes of naive reward models, and proposing causally-inspired approaches for robustness.

Source

Link: https://arxiv.org/abs/2506.05967
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- iterative-alignment-at-post-train-time — Black-box safety (understand and control current model behaviour) / Iterative alignment

iterative-alignment-at-post-train-time

AI Safety Compendium

Explorer

Preference Learning for AI Alignment: a Causal Perspective

Preference Learning for AI Alignment: a Causal Perspective

Summary

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Preference Learning for AI Alignment: a Causal Perspective

Preference Learning for AI Alignment: a Causal Perspective

Summary

Source

Related Pages

Graph View

Graph view

Table of Contents