Trading Inference-Time Compute for Adversarial Robustness
OpenAI — 2025-01-22 — OpenAI — arXiv
Summary
Empirical investigation showing that reasoning models like o1-preview and o1-mini become more robust to various adversarial attacks (many-shot, prompt injection, soft tokens, LMP attacks) as they use more inference-time compute, with attack success probability often decaying to near zero.
Key Result
Adversary success probability decreases as inference-time compute increases across multiple attack types, though exceptions exist where models can be fooled into not using compute productively.
Source
- Link: https://openai.com/index/trading-inference-time-compute-for-adversarial-robustness
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals