InvThink: Towards AI Safety via Inverse Reasoning
Yubin Kim, Taehan Kim, Eugene Park, Chunjong Park, Cynthia Breazeal, Daniel McDuff, … (+1 more) — 2025-10-02 — arXiv
Summary
Presents InvThink, a training method that teaches LLMs to enumerate potential harms and analyze their consequences before generating responses, implemented via supervised fine-tuning and reinforcement learning across three model families.
Key Result
InvThink achieves up to 15.7% reduction in harmful responses compared to baseline safety methods while preserving general reasoning capabilities and showing stronger safety improvements with model scale.
Source
- Link: https://arxiv.org/abs/2510.01569
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- inference-time-in-context-learning — Black-box safety (understand and control current model behaviour) / Iterative alignment