Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Buck Shlegeris, Julian Stastny — 2025-05-08 — Redwood Research — LessWrong / AI Alignment Forum

Summary

Analyzes whether training can prevent scheming AI models from deliberately underperforming (sandbagging) on critical tasks, introducing the concept of exploration hacking where models resist training by avoiding high-reward actions, and evaluating potential countermeasures.

Source