Forecasting Rare Language Model Behaviors
Erik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger Grosse, … (+4 more) — 2025-02-24 — Anthropic, OpenAI, UC Berkeley — arXiv
Summary
Introduces a method to forecast rare dangerous behaviors (chemical synthesis, power-seeking) at deployment scale by analyzing elicitation probabilities from small-scale evaluations, predicting emergence across up to three orders of magnitude more queries than tested.
Key Result
Successfully predicted emergence of diverse undesirable behaviors across up to three orders of magnitude of query volume by studying how largest observed elicitation probabilities scale with number of queries.
Source
- Link: https://arxiv.org/abs/2502.16797
- Listed in the Shallow Review of Technical AI Safety 2025 under 2 agenda(s):
- anthropic — Labs (giant companies)
- capability-evals — Evals