AI Safety Atlas Ch.1 — Current Capabilities

Source: Current Capabilities | ai-safety-atlas.com/chapters/v1/capabilities/current-capabilities/ | Authors: Markov Grey & Charbel-Raphaël Ségerie | 14 min

A late-2025 survey of what AI systems can actually do across seven capability classes. The trajectory matters more than the snapshot — and the trajectory is fast: language went from coherent paragraphs to research assistants in a few years; image generation went from laughable to professional-grade in a decade; video generation is following a similar path compressed into three years.

Games

Game-playing AI is at superhuman level for many games. Milestones: Deep Blue beat Kasparov (1997), Watson won Jeopardy (2011), AlphaGo beat Lee Sedol (2016) with the famously creative Move 37. Extension to video games: OpenAI Five at DOTA 2 (2019), AlphaStar at StarCraft II (2019), MuZero playing Atari/Go/chess/shogi without being told the rules (2020).

By 2023, Voyager demonstrated LLM-powered Minecraft play — long-horizon planning across hundreds of steps. Meta’s Cicero showed strategic negotiation and deception in Diplomacy through natural language.

The lesson is transfer: planning, pattern recognition, and adversarial thinking developed in games now apply to scientific research, mathematical proofs, and complex real-world problems.

Text Generation

By 2025, GPT-5 hits 92.5% on MMLU across STEM, law, nutrition, religion. Comparable performance from Claude (Anthropic), Grok (X), GLM-4.7 (Zhipu AI), DeepSeek-v3.2.

Language models have become a substrate for many higher-level capabilities — research, reasoning, code — all stemming from the same generate-and-refine loop.

Tool Use

LLMs intelligently use external tools, dramatically boosting performance. “In December 2025, at least 10,000 tool servers are operational, including meta-tools like ‘a tool to search for tools’.” OpenAI’s o3 with tools beats o3-alone by ~5% on Humanities Last Exam (HLE). Companies now report benchmarks separately with and without tools — the gap is large enough to matter.

Reasoning and Research

Late 2024 saw OpenAI’s o1 — the first “reasoning” model. These trade thinking time (inference compute) for accuracy. By 2025, both OpenAI and Google DeepMind achieved gold-medal performance at the 2025 IMO. On FrontierMath (research-level mathematics), GPT-5.2 solved 41% of tier 1–3 problems.

The chapter introduces the term inference time scaling for this — generating more tokens before answering. Like tool use, this delivered enough boost that companies report results separately for high-compute and low-compute thinking.

Scientific Research as Capability

  • Google’s AI co-scientist (2025) generates and evaluates proposals for drug repurposing, drug targets, antimicrobial resistance.
  • Fully autonomous AI scientists generate ideas, write code, run experiments, write papers, simulate review.
  • AlphaFold (Hassabis & Jumper, Nobel in Chemistry) — protein folding. AlphaGenome for human DNA. AlphaEvolve for faster ML algorithms. AlphaChip for semiconductor design.

LLMs also show metacognition (predicting which questions they can answer correctly) and theory of mind (attributing mental states).

Software Development

Coding evolved from autocomplete to collaborative software development. “In 2025, systems like Claude Opus 4.5, Gemini 3 Pro, and GPT-5.2 implement features and entire applications increasingly independently.”

Concrete benchmark — SWE-Bench (real GitHub issues): 15% (Claude 3 Opus, 2024) → 74% (Tools + Claude 4 Opus, 2025).

Vision: Images and Video

GANs (2014) → photorealistic scenes from prompts (2023) → Midjourney v7 indistinguishable from professional photography (2025). Video generation following the same trajectory; deepfakes increasingly indistinguishable.

LMMs (Large Multimodal Models) like Gemini 3 Pro Image work text-to-image, image-to-image, and complex editing — changing lighting, style, composition while maintaining coherence.

Robotics

LLMs + visual models = robot control models (RT-1, RT-2). Robots use language-model techniques (decompose actions into step-by-step plans) to control manipulators — opening cabinets, operating elevators, sautéing shrimp, rinsing pans.

Industrial scale matters: China installed 276,300 industrial robots in 2023. Amazon operates over 1 million robots across fulfillment.

Why This Subchapter Matters for Safety

Three implications, mostly implicit but flagged throughout:

  1. The benchmark-destruction pattern continues — confirms what scaling-laws documents about benchmark obsolescence.
  2. Tool use + reasoning scaffolds compound capability beyond raw model scale — supporting the “Scale + Techniques + Tools” hypothesis from the previous subchapter.
  3. Embodiment + scientific research capability raise specific ai-military-applications / biosecurity concerns absent from a pure language-model frame.

Connection to Wiki

This is the chapter’s “show, don’t tell” of capabilities — the same domain reviewed in situational-awareness (2024) and ai-2027 (forward projection), but with late-2025 numbers grounded in named systems. Useful as a reality-check anchor for any “how capable is AI today?” query.