Trading Inference-Time Compute for Adversarial Robustness

OpenAI — 2025-01-22 — OpenAI — arXiv

Summary

Empirical investigation showing that reasoning models like o1-preview and o1-mini become more robust to various adversarial attacks (many-shot, prompt injection, soft tokens, LMP attacks) as they use more inference-time compute, with attack success probability often decaying to near zero.

Key Result

Adversary success probability decreases as inference-time compute increases across multiple attack types, though exceptions exist where models can be fooled into not using compute productively.

Source