Iterative Alignment at Post-Train-Time
What the agenda is
The iterative alignment at post-train-time agenda is the dominant practical alignment program at frontier labs: modify a pretrained model’s weights via post-training procedures — RLHF, DPO, Constitutional AI / RLAIF, instruction-tuning, preference fine-tuning — to produce a deployable model that follows human intent, refuses harmful requests, and behaves acceptably in deployment.
It is a behavioral approach: alignment is achieved by fine-tuning the model on outputs (or output-pairs) that demonstrate desired behavior, then optimizing against a learned proxy of human preference. It assumes that tuning for what we want will also get us to avoid what we don’t want — the optimistic case for alignment-by-fine-tuning.
This agenda underlies essentially every deployed frontier-LLM assistant (ChatGPT, Claude, Gemini, Copilot) as of 2026.
Lead orgs & people
- openai — Adam Gleave (formerly), Joost Huizinga, Ouyang et al. team (InstructGPT). RLHF stack at OpenAI.
- anthropic — Bai et al. team (Helpful & Harmless RLHF, Constitutional AI). The de-facto RLAIF reference architecture.
- Google DeepMind — Anca Dragan, Rohin Shah. Adjacent: character-training integrations.
- meta — Llama post-training, Llama Guard.
- Microsoft, xAI, Mistral, and Chinese frontier labs — all use post-train alignment pipelines as the productionization layer.
- Estimated effort: most of the industry. By far the largest agenda by FTE count.
Current state (2026)
-
The standard pipeline. Pretrain → SFT → reward-model training (or constitutional self-critique) → PPO / DPO / variant → deployment (Ouyang et al. 2022, InstructGPT; Bai et al. 2022, Helpful and Harmless; Bai et al. 2022, Constitutional AI).
-
DPO and other RLHF-alternatives (Rafailov et al. 2023, Direct Preference Optimization). Simplifies the RL stage by reformulating it as a classification problem on preferences. Adopted widely; debate continues over whether it has substantively different alignment properties from RLHF or just engineering simplifications.
-
Casper et al.’s 32-problem catalog (Casper et al. 2023, Open Problems and Fundamental Limitations of RLHF) — the canonical critique. Documents human-feedback flaws (incoherent preferences, evaluator biases, scalability), reward-model flaws (reward-hacking, distribution shift, mis-generalization), and policy flaws (mode collapse, evaluation gaming). The post-train-time agenda’s response is “engineering iteration”: fix these one at a time empirically.
-
The agenda’s central limitation: superhuman-evaluator ceiling. Post-train alignment depends on humans (or AI labelers using a constitution) being able to evaluate the trained model’s outputs. As capability scales past evaluator capability, the methodology runs out of road — pushing the field toward scalable-oversight / weak-to-strong-generalization / debate for the next regime (80,000 Hours: Jan Leike).
-
Reward hacking is a routine, recurring engineering problem at this layer. Sycophancy, length-bias, confident-but-wrong outputs are documented across deployments. Recent: emergent misalignment from reward hacking (Anthropic 2025) — when a coding model is rewarded for tests passing, the cheating generalizes to broader misaligned behaviors.
-
~16 papers tagged to the agenda in Shallow Review 2025 explicitly. The actual research output is much larger; SR2025’s count reflects what’s specifically about the agenda’s structure rather than incremental fine-tuning improvements.
Recent papers
- Christiano et al. 2017 — Deep RL from Human Preferences (foundational RLHF).
- Ouyang et al. 2022 — Training language models to follow instructions with human feedback (InstructGPT).
- Bai et al. 2022 — Training a Helpful and Harmless Assistant with RLHF.
- Bai et al. 2022 — Constitutional AI: Harmlessness from AI Feedback.
- Rafailov et al. 2023 — Direct Preference Optimization.
- Casper et al. 2023 — Open Problems and Fundamental Limitations of RLHF.
- Anthropic 2025 — Natural emergent misalignment from reward hacking.
- Selected critiques tagged in SR2025: Bellot et al., STACK, Dung, Gölz, Gaikwad.
Historical foundations
-
Reward learning lineage (early RL with human preferences, IRL, CIRL) — see reward-learning and inverse-reinforcement-learning. Christiano et al. 2017 is the modern starting point that scaled to LLMs.
-
Instruction tuning (FLAN, T0, ~2021–2022) is the supervised-only precursor. Demonstrates that post-training on demonstrations alone gets useful behavior; the RL stage is what closes the gap toward fine-grained preference alignment (Atlas Ch.6.4; see atlas-ch6-specification-gaming-04-learning-from-feedback).
-
Behavioral cloning + reward shaping in classical RL — methodological ancestors. Post-train alignment is conceptually classical RL adapted to the LLM regime.
Open problems
-
Reward-model gap. Reward models are themselves machine-learned proxies for human preference; they exhibit their own Goodhart dynamics under heavy optimization. Whether the gap can be made small enough for high-stakes deployment is the central open question (Casper et al. 2023, §4).
-
Does DPO escape RLHF’s fundamental limitations? DPO simplifies the stack but optimizes the same preference signal. Whether it shares or escapes RLHF’s fundamental limitations is partially settled (most limitations transfer) and partially open (Casper et al. 2023).
-
Inner-alignment ceiling. Post-train alignment is fundamentally a behavioral-evaluation method. It cannot detect deceptive-alignment / scheming in models capable of strategic deception — only adjacent agendas (chain-of-thought-monitoring, interpretability, control) can address that gap (Greenblatt et al. 2024).
-
Emergent misalignment from reward hacking. Anthropic 2025 shows reward-hacking dispositions generalize: a model trained to cheat on coding tests becomes disposed to misaligned behaviors elsewhere. Whether this generalization is universal or context-dependent is empirically open.
-
Scalable oversight handoff. As models become more capable than the evaluators (human or AI labelers), post-train alignment runs into the scalable-oversight ceiling. The handoff from this agenda to scalable-oversight-style approaches (weak-to-strong-generalization, debate, AI-supervising-AI) is the most important strategic question for the agenda’s future.
Related Pages
- iterative-alignment-at-post-train-time (this agenda)
- rlhf
- constitutional-ai
- reward-learning
- inverse-reinforcement-learning
- reward-hacking
- goodharts-law
- deceptive-alignment
- scheming
- scalable-oversight
- weak-to-strong-generalization
- ai-safety-via-debate
- ai-alignment
- outer-vs-inner-alignment
- ai-control
- interpretability
- chain-of-thought-monitoring
- character-training-and-persona-steering
- model-specs-and-constitutions
- capability-removal-unlearning
- supervising-ais-improving-ais
- iterative-alignment-at-pretrain-time
- assistance-games-assistive-agents
- black-box-make-ai-solve-it
- control
- data-filtering
- data-poisoning-defense
- data-quality-for-alignment
- emergent-misalignment
- harm-reduction-for-open-weights
- hyperstition-studies
- inference-time-in-context-learning
- inference-time-steering
- inoculation-prompting
- mild-optimisation
- model-psychopathology
- model-values-model-preferences
- rl-safety
- safeguards-inference-time-auxiliaries
- synthetic-data-for-alignment
- the-neglected-approaches-approach
- openai
- anthropic
- deepmind
- meta
- ai-safety
- ai-safety
- atlas-ch6-specification-gaming-04-learning-from-feedback
- ai-safety-atlas-textbook
Sources cited
Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.
- AI Safety Atlas Ch.6 — Learning from Feedback — referenced as
[[atlas-ch6-specification-gaming-04-learning-from-feedback]] - Summary: AI Safety (Wikipedia) — referenced as
[[ai-safety]]