Superalignment with Dynamic Human Values

Florian Mai, David Kaczér, Nicholas Kluge Corrêa, Lucie Flek — 2025-03-17 — ICLR 2025 Workshop on Bidirectional Human-AI Alignment (BiAlign)

Summary

Proposes a framework for superalignment that trains superhuman reasoning models to decompose complex tasks into subtasks amenable to human guidance, introducing the part-to-complete generalization hypothesis that alignment of subtask solutions generalizes to complete solutions.

Source