AI Alignment at Your Discretion
Maarten Buyl, Hadi Khalaf, Claudio Mayrink Verdun, Lucas Monteiro Paes, Caio C. Vieira Machado, Flavio du Pin Calmon — 2025-02-10 — arXiv
Summary
Introduces the concept of ‘alignment discretion’ - the latitude granted to annotators in judging model outputs - and develops metrics to systematically measure when and how this discretion is exercised by both human and algorithmic annotators on safety alignment datasets.
Key Result
Reveals that algorithms trained on alignment datasets develop their own forms of discretion in interpreting principles, creating discrepancies with human discretion that challenge the purpose of having alignment principles.
Source
- Link: https://arxiv.org/abs/2502.10441
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- data-quality-for-alignment — Black-box safety (understand and control current model behaviour) / Better data