EigenBench: A Comparative behavioural Measure of Value Alignment

Jonathn Chang, Leonhard Piff, Suvadip Sana, Jasmine X. Li, Lionel Levine — 2025-09-02 — arXiv

Summary

Proposes EigenBench, a black-box method for comparatively benchmarking language models’ value alignment by having models judge each other’s outputs against a constitution, with judgments aggregated via EigenTrust to produce alignment scores without ground truth labels.

Key Result

EigenBench’s judgments align closely with human evaluators and can recover model rankings on GPQA benchmark without objective labels, validating its viability for evaluating subjective values.

Source

Link: https://arxiv.org/abs/2509.01938
Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- model-values-model-preferences — Black-box safety (understand and control current model behaviour) / Model psychology

model-values-model-preferences

AI Safety Compendium

Explorer

EigenBench: A Comparative behavioural Measure of Value Alignment

EigenBench: A Comparative behavioural Measure of Value Alignment

Summary

Key Result

Source

Graph View

Graph view

Table of Contents

AI Safety Compendium

Explorer

EigenBench: A Comparative behavioural Measure of Value Alignment

EigenBench: A Comparative behavioural Measure of Value Alignment

Summary

Key Result

Source

Related Pages

Graph View

Graph view

Table of Contents