InvThink: Towards AI Safety via Inverse Reasoning

Yubin Kim, Taehan Kim, Eugene Park, Chunjong Park, Cynthia Breazeal, Daniel McDuff, … (+1 more) — 2025-10-02 — arXiv

Summary

Presents InvThink, a training method that teaches LLMs to enumerate potential harms and analyze their consequences before generating responses, implemented via supervised fine-tuning and reinforcement learning across three model families.

Key Result

InvThink achieves up to 15.7% reduction in harmful responses compared to baseline safety methods while preserving general reasoning capabilities and showing stronger safety improvements with model scale.

Source