Training AI to do alignment research we don’t already know how to do

joshc — 2025-02-24 — Redwood Research — LessWrong

Summary

Proposes ‘training for truth-seeking’ approach where AI agents are trained to improve their beliefs and report findings like human experts, arguing this addresses ‘garbage-in garbage-out’ concerns about automating alignment research without requiring humans to already have correct beliefs.

Source