Putting up Bumpers

2025 — Anthropic — Anthropic Alignment Science Blog

Summary

Proposes an iterative alignment strategy for early AGI systems called ‘bumpers’ - building multiple independent detection methods to catch misalignment, then rewinding and adjusting training when warning signs appear, rather than relying on deep theoretical understanding of alignment.

Source