AI Safety Compendium

Home

❯

summaries

❯

Training on Documents About Reward Hacking Induces Reward Hacking

Training on Documents About Reward Hacking Induces Reward Hacking

27 Apr 20261 min read

Training on Documents About Reward Hacking Induces Reward Hacking

Evan Hubinger, Nathan Hu — 2025-01-21 — Anthropic

Source

  • Link: https://www.lesswrong.com/posts/qXYLvjGL9QvD3aFSW/training-on-documents-about-reward-hacking-induces-reward
  • Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
    • hyperstition-studies — Black-box safety (understand and control current model behaviour) / Better data

Related Pages

  • hyperstition-studies

Graph View

Graph view

The interactive citation graph is desktop-only. Visit this page on a larger screen to explore how concepts, agendas, papers, and organisations link together.

  • Training on Documents About Reward Hacking Induces Reward Hacking
  • Source
  • Related Pages

Created with Quartz v0.1.0 © 2026

  • Suggest a source
  • Connect
  • Overview
  • About (proof of concept)
  • Email feedback
  • Made by IT for Humanity