Superalignment with Dynamic Human Values
Florian Mai, David Kaczér, Nicholas Kluge Corrêa, Lucie Flek — 2025-03-17 — ICLR 2025 Workshop on Bidirectional Human-AI Alignment (BiAlign)
Summary
Proposes a framework for superalignment that trains superhuman reasoning models to decompose complex tasks into subtasks amenable to human guidance, introducing the part-to-complete generalization hypothesis that alignment of subtask solutions generalizes to complete solutions.
Source
- Link: https://arxiv.org/abs/2503.13621
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- black-box-make-ai-solve-it — Black-box safety (understand and control current model behaviour) / Iterative alignment