Technical Report: Evaluating Goal Drift in Language Model Agents

Rauno Arike, Elizabeth Donoway, Henning Bartsch, Marius Hobbhahn — 2025-05-05 — Apollo Research — arXiv

Summary

Develops and applies a novel evaluation methodology for measuring goal drift in language model agents by exposing them to competing objectives through environmental pressures over extended contexts, testing whether they maintain adherence to originally assigned goals.

Key Result

All evaluated models exhibit some degree of goal drift, with the best-performing agent (scaffolded Claude 3.5 Sonnet) maintaining nearly perfect goal adherence for over 100,000 tokens, and goal drift correlating with increasing susceptibility to pattern-matching behaviors as context length grows.

Source

Link: https://arxiv.org/abs/2505.02709
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- capability-evals — Evals

capability-evals

AI Safety Compendium

Explorer

Technical Report: Evaluating Goal Drift in Language Model Agents

Technical Report: Evaluating Goal Drift in Language Model Agents

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

Technical Report: Evaluating Goal Drift in Language Model Agents

Technical Report: Evaluating Goal Drift in Language Model Agents

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents