Trading Inference-Time Compute for Adversarial Robustness

OpenAI — 2025-01-22 — OpenAI — arXiv

Summary

Empirical investigation showing that reasoning models like o1-preview and o1-mini become more robust to various adversarial attacks (many-shot, prompt injection, soft tokens, LMP attacks) as they use more inference-time compute, with attack success probability often decaying to near zero.

Key Result

Adversary success probability decreases as inference-time compute increases across multiple attack types, though exceptions exist where models can be fooled into not using compute productively.

Source

Link: https://openai.com/index/trading-inference-time-compute-for-adversarial-robustness
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- various-redteams — Evals

various-redteams

AI Safety Compendium

Explorer

Trading Inference-Time Compute for Adversarial Robustness

Trading Inference-Time Compute for Adversarial Robustness

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Trading Inference-Time Compute for Adversarial Robustness

Trading Inference-Time Compute for Adversarial Robustness

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents