EigenBench: A Comparative behavioural Measure of Value Alignment

Jonathn Chang, Leonhard Piff, Suvadip Sana, Jasmine X. Li, Lionel Levine — 2025-09-02 — arXiv

Summary

Proposes EigenBench, a black-box method for comparatively benchmarking language models’ value alignment by having models judge each other’s outputs against a constitution, with judgments aggregated via EigenTrust to produce alignment scores without ground truth labels.

Key Result

EigenBench’s judgments align closely with human evaluators and can recover model rankings on GPQA benchmark without objective labels, validating its viability for evaluating subjective values.

Source