LLMs Outperform Experts on Challenging Biology Benchmarks

Lennart Justen — 2025-05-09 — arXiv

Summary

Systematically evaluates 27 frontier LLMs on eight biology benchmarks spanning molecular biology, genetics, virology, and biosecurity, tracking dangerous capability growth from November 2022 to April 2025.

Key Result

OpenAI’s o3 now performs twice as well as expert virologists on the Virology Capabilities Test, with top model performance increasing more than 4-fold over the study period.

Source

Link: https://arxiv.org/abs/2505.06108
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- wmd-evals-weapons-of-mass-destruction — Evals

wmd-evals-weapons-of-mass-destruction

AI Safety Compendium

Explorer

LLMs Outperform Experts on Challenging Biology Benchmarks

LLMs Outperform Experts on Challenging Biology Benchmarks

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

LLMs Outperform Experts on Challenging Biology Benchmarks

LLMs Outperform Experts on Challenging Biology Benchmarks

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents