Training on Documents About Reward Hacking Induces Reward Hacking
Evan Hubinger, Nathan Hu — 2025-01-21 — Anthropic
Source
- Link: https://www.lesswrong.com/posts/qXYLvjGL9QvD3aFSW/training-on-documents-about-reward-hacking-induces-reward
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- hyperstition-studies — Black-box safety (understand and control current model behaviour) / Better data