AI Alignment at Your Discretion

Maarten Buyl, Hadi Khalaf, Claudio Mayrink Verdun, Lucas Monteiro Paes, Caio C. Vieira Machado, Flavio du Pin Calmon — 2025-02-10 — arXiv

Summary

Introduces the concept of ‘alignment discretion’ - the latitude granted to annotators in judging model outputs - and develops metrics to systematically measure when and how this discretion is exercised by both human and algorithmic annotators on safety alignment datasets.

Key Result

Reveals that algorithms trained on alignment datasets develop their own forms of discretion in interpreting principles, creating discrepancies with human discretion that challenge the purpose of having alignment principles.

Source

Link: https://arxiv.org/abs/2502.10441
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- data-quality-for-alignment — Black-box safety (understand and control current model behaviour) / Better data

data-quality-for-alignment

AI Safety Compendium

Explorer

AI Alignment at Your Discretion

AI Alignment at Your Discretion

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

AI Alignment at Your Discretion

AI Alignment at Your Discretion

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents