Discovering Forbidden Topics in Language Models

Can Rager, Chris Wendler, Rohit Gandikota, David Bau — 2025-05-23 — arXiv

Summary

Introduces refusal discovery as a new problem setting and develops Iterated Prefill Crawler (IPC) method to systematically identify the full set of topics language models refuse to discuss using token prefilling, revealing CCP-aligned censorship patterns in DeepSeek-R1-70B.

Key Result

IPC retrieves 31 out of 36 forbidden topics on Tulu-3-8B within 1000 prompts and discovers ‘thought suppression’ behavior in DeepSeek-R1-70B indicating memorization of CCP-aligned responses.

Source

Link: https://arxiv.org/abs/2505.17441
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Discovering Forbidden Topics in Language Models

Discovering Forbidden Topics in Language Models

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Discovering Forbidden Topics in Language Models

Discovering Forbidden Topics in Language Models

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents