AI Alignment at Your Discretion

Maarten Buyl, Hadi Khalaf, Claudio Mayrink Verdun, Lucas Monteiro Paes, Caio C. Vieira Machado, Flavio du Pin Calmon — 2025-02-10 — arXiv

Summary

Introduces the concept of ‘alignment discretion’ - the latitude granted to annotators in judging model outputs - and develops metrics to systematically measure when and how this discretion is exercised by both human and algorithmic annotators on safety alignment datasets.

Key Result

Reveals that algorithms trained on alignment datasets develop their own forms of discretion in interpreting principles, creating discrepancies with human discretion that challenge the purpose of having alignment principles.

Source