Validating against a misalignment detector is very different to training against one
mattmacdermott — 2025-03-04 — LessWrong
Summary
Argues that using misalignment detectors for validation/rejection sampling preserves Bayesian posteriors and avoids Goodharting, while training against detectors just produces undetectably misaligned models.
Source
- Link: https://lesswrong.com/posts/CXYf7kGBecZMajrXC/validating-against-a-misalignment-detector-is-very-different
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- capability-evals — Evals