Forecasting Rare Language Model Behaviors

Erik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger Grosse, … (+4 more) — 2025-02-24 — Anthropic, OpenAI, UC Berkeley — arXiv

Summary

Introduces a method to forecast rare dangerous behaviors (chemical synthesis, power-seeking) at deployment scale by analyzing elicitation probabilities from small-scale evaluations, predicting emergence across up to three orders of magnitude more queries than tested.

Key Result

Successfully predicted emergence of diverse undesirable behaviors across up to three orders of magnitude of query volume by studying how largest observed elicitation probabilities scale with number of queries.

Source