Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Buck Shlegeris, Julian Stastny — 2025-05-08 — Redwood Research — LessWrong / AI Alignment Forum

Summary

Analyzes whether training can prevent scheming AI models from deliberately underperforming (sandbagging) on critical tasks, introducing the concept of exploration hacking where models resist training by avoiding high-reward actions, and evaluating potential countermeasures.

Source

Link: https://lesswrong.com/posts/TeTegzR8X5CuKgMc3/misalignment-and-strategic-underperformance-an-analysis-of
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sandbagging-evals — Evals

sandbagging-evals

AI Safety Compendium

Explorer

Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Summary

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Summary

Source

Related Pages

Graph View

Graph view

Table of Contents