Training AI to do alignment research we don’t already know how to do
joshc — 2025-02-24 — Redwood Research — LessWrong
Summary
Proposes ‘training for truth-seeking’ approach where AI agents are trained to improve their beliefs and report findings like human experts, arguing this addresses ‘garbage-in garbage-out’ concerns about automating alignment research without requiring humans to already have correct beliefs.
Source
- Link: https://lesswrong.com/posts/5gmALpCetyjkSPEDr/training-ai-to-do-alignment-research-we-don-t-already-know
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- black-box-make-ai-solve-it — Black-box safety (understand and control current model behaviour) / Iterative alignment