The PacifAIst Benchmark:Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety?

Manuel Herrador — 2025-08-13 — arXiv

Summary

Introduces PacifAIst, a benchmark of 700 scenarios testing whether LLMs prioritize human safety over instrumental goals like self-preservation, resource acquisition, and goal completion. Evaluates 8 frontier models using a novel Existential Prioritization taxonomy and Pacifism Score metric.

Key Result

Gemini 2.5 Flash achieved the highest Pacifism Score at 90.31% while GPT-5 recorded the lowest at 79.49%, with significant variation across self-preservation, resource conflict, and goal preservation subcategories.

Source

Link: https://arxiv.org/abs/2508.09762
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- model-values-model-preferences — Black-box safety (understand and control current model behaviour) / Model psychology

model-values-model-preferences

AI Safety Compendium

Explorer

The PacifAIst Benchmark:Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety?

The PacifAIst Benchmark:Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety?

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

The PacifAIst Benchmark:Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety?

The PacifAIst Benchmark:Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety?

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents