Validating against a misalignment detector is very different to training against one

mattmacdermott — 2025-03-04 — LessWrong

Summary

Argues that using misalignment detectors for validation/rejection sampling preserves Bayesian posteriors and avoids Goodharting, while training against detectors just produces undetectably misaligned models.

Source