Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
Buck Shlegeris, Julian Stastny — 2025-05-08 — Redwood Research — LessWrong / AI Alignment Forum
Summary
Analyzes whether training can prevent scheming AI models from deliberately underperforming (sandbagging) on critical tasks, introducing the concept of exploration hacking where models resist training by avoiding high-reward actions, and evaluating potential countermeasures.
Source
- Link: https://lesswrong.com/posts/TeTegzR8X5CuKgMc3/misalignment-and-strategic-underperformance-an-analysis-of
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sandbagging-evals — Evals