Forecasting Rare Language Model Behaviors

Erik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger Grosse, … (+4 more) — 2025-02-24 — Anthropic, OpenAI, UC Berkeley — arXiv

Summary

Introduces a method to forecast rare dangerous behaviors (chemical synthesis, power-seeking) at deployment scale by analyzing elicitation probabilities from small-scale evaluations, predicting emergence across up to three orders of magnitude more queries than tested.

Key Result

Successfully predicted emergence of diverse undesirable behaviors across up to three orders of magnitude of query volume by studying how largest observed elicitation probabilities scale with number of queries.

Source

Link: https://arxiv.org/abs/2502.16797
Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- anthropic — Labs (giant companies)
- capability-evals — Evals

anthropic
capability-evals

AI Safety Compendium

Explorer

Forecasting Rare Language Model Behaviors

Forecasting Rare Language Model Behaviors

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Forecasting Rare Language Model Behaviors

Forecasting Rare Language Model Behaviors

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents