EigenBench: A Comparative behavioural Measure of Value Alignment
Jonathn Chang, Leonhard Piff, Suvadip Sana, Jasmine X. Li, Lionel Levine — 2025-09-02 — arXiv
Summary
Proposes EigenBench, a black-box method for comparatively benchmarking language models’ value alignment by having models judge each other’s outputs against a constitution, with judgments aggregated via EigenTrust to produce alignment scores without ground truth labels.
Key Result
EigenBench’s judgments align closely with human evaluators and can recover model rankings on GPQA benchmark without objective labels, validating its viability for evaluating subjective values.
Source
- Link: https://arxiv.org/abs/2509.01938
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- model-values-model-preferences — Black-box safety (understand and control current model behaviour) / Model psychology