Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
Cameron Tice, Philipp Alexander Kreer, Nathan Helm-Burger, Prithviraj Singh Shahani, Fedor Ryzhenkov, Jacob Haimes, … (+2 more) — 2024-12-02
Source
- Link: https://arxiv.org/pdf/2412.01784
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- sandbagging-evals — Evals