Alignment Techniques

Methods for making AI systems pursue goals consistent with human intentions. The pages tagged here cover the active techniques — RLHF, Constitutional AI, IDA, debate, scalable-oversight, W2S — and the operational counterpart, control, which bounds consequences when alignment cannot be guaranteed. The cluster is what most of the safety field works on day-to-day.

The technical question that organises this pillar is how do we keep alignment intact as capability scales beyond human evaluator capacity? — see superalignment for the program-level framing, outer-vs-inner-alignment for the foundational decomposition.

Pages tagged here