Discovering Forbidden Topics in Language Models

Can Rager, Chris Wendler, Rohit Gandikota, David Bau — 2025-05-23 — arXiv

Summary

Introduces refusal discovery as a new problem setting and develops Iterated Prefill Crawler (IPC) method to systematically identify the full set of topics language models refuse to discuss using token prefilling, revealing CCP-aligned censorship patterns in DeepSeek-R1-70B.

Key Result

IPC retrieves 31 out of 36 forbidden topics on Tulu-3-8B within 1000 prompts and discovers ‘thought suppression’ behavior in DeepSeek-R1-70B indicating memorization of CCP-aligned responses.

Source