The PacifAIst Benchmark:Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety?
Manuel Herrador — 2025-08-13 — arXiv
Summary
Introduces PacifAIst, a benchmark of 700 scenarios testing whether LLMs prioritize human safety over instrumental goals like self-preservation, resource acquisition, and goal completion. Evaluates 8 frontier models using a novel Existential Prioritization taxonomy and Pacifism Score metric.
Key Result
Gemini 2.5 Flash achieved the highest Pacifism Score at 90.31% while GPT-5 recorded the lowest at 79.49%, with significant variation across self-preservation, resource conflict, and goal preservation subcategories.
Source
- Link: https://arxiv.org/abs/2508.09762
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- model-values-model-preferences — Black-box safety (understand and control current model behaviour) / Model psychology