Training AI to do alignment research we don’t already know how to do

joshc — 2025-02-24 — Redwood Research — LessWrong

Summary

Proposes ‘training for truth-seeking’ approach where AI agents are trained to improve their beliefs and report findings like human experts, arguing this addresses ‘garbage-in garbage-out’ concerns about automating alignment research without requiring humans to already have correct beliefs.

Source

Link: https://lesswrong.com/posts/5gmALpCetyjkSPEDr/training-ai-to-do-alignment-research-we-don-t-already-know
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- black-box-make-ai-solve-it — Black-box safety (understand and control current model behaviour) / Iterative alignment

black-box-make-ai-solve-it

AI Safety Compendium

Explorer

Training AI to do alignment research we don't already know how to do

Training AI to do alignment research we don’t already know how to do

Summary

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Training AI to do alignment research we don't already know how to do

Training AI to do alignment research we don’t already know how to do

Summary

Source

Related Pages

Graph View

Graph view

Table of Contents