Discovering Forbidden Topics in Language Models
Can Rager, Chris Wendler, Rohit Gandikota, David Bau — 2025-05-23 — arXiv
Summary
Introduces refusal discovery as a new problem setting and develops Iterated Prefill Crawler (IPC) method to systematically identify the full set of topics language models refuse to discuss using token prefilling, revealing CCP-aligned censorship patterns in DeepSeek-R1-70B.
Key Result
IPC retrieves 31 out of 36 forbidden topics on Tulu-3-8B within 1000 prompts and discovers ‘thought suppression’ behavior in DeepSeek-R1-70B indicating memorization of CCP-aligned responses.
Source
- Link: https://arxiv.org/abs/2505.17441
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals