InvThink: Towards AI Safety via Inverse Reasoning

Yubin Kim, Taehan Kim, Eugene Park, Chunjong Park, Cynthia Breazeal, Daniel McDuff, … (+1 more) — 2025-10-02 — arXiv

Summary

Presents InvThink, a training method that teaches LLMs to enumerate potential harms and analyze their consequences before generating responses, implemented via supervised fine-tuning and reinforcement learning across three model families.

Key Result

InvThink achieves up to 15.7% reduction in harmful responses compared to baseline safety methods while preserving general reasoning capabilities and showing stronger safety improvements with model scale.

Source

Link: https://arxiv.org/abs/2510.01569
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- inference-time-in-context-learning — Black-box safety (understand and control current model behaviour) / Iterative alignment

inference-time-in-context-learning

AI Safety Compendium

Explorer

InvThink: Towards AI Safety via Inverse Reasoning

InvThink: Towards AI Safety via Inverse Reasoning

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

InvThink: Towards AI Safety via Inverse Reasoning

InvThink: Towards AI Safety via Inverse Reasoning

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents