EigenBench: A Comparative Behavioral Measure of Value Alignment

Jonathn Chang, Leonhard Piff, Suvadip Sana, Jasmine Xinze Li, Lionel Levine

Safety, Privacy & Alignment Thu, Apr 23 · 4:27 PM–4:37 PM · 203 A/B Avg rating: 6.00 (4–10)

Abstract

Aligning AI with human values is a pressing unsolved problem. To address the lack of quantitative metrics for value alignment, we propose EigenBench: a black-box method for comparatively benchmarking language models’ values. Given an ensemble of models, a constitution describing a value system, and a dataset of scenarios, our method returns a vector of scores quantifying each model’s alignment to the given constitution. To produce these scores, each model judges the outputs of other models across many scenarios, and these judgments are aggregated with EigenTrust (Kamvar et al., 2003), yielding scores that reflect a weighted consensus judgment of the whole ensemble. EigenBench uses no ground truth labels, as it is designed to quantify subjective traits for which reasonable judges may disagree on the correct label. Hence, to validate our method, we collect human judgments on the same ensemble of models and show that EigenBench’s judgments align closely with those of human evaluators. We further demonstrate that EigenBench can recover model rankings on the GPQA benchmark without access to objective labels, supporting its viability as a framework for evaluating subjective values for which no ground truths exist. The code is available at https://github.com/jchang153/EigenBench.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

EigenBench measures language model value alignment using model ensemble judgments aggregated with EigenTrust without ground truth labels.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Black-box method for benchmarking LLM values using peer model judgments and EigenTrust aggregation
Framework for quantifying subjective traits where no ground truth exists
Validation through human judgment comparison and objective ranking recovery on GPQA

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

EigenTrust aggregation
peer model evaluation
ensemble methods

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

GPQA benchmark

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Data collection process is inefficient requiring multiple model and comparison calls
from the paper
Limited examination of GPQA result as unsupervised evaluation method
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Incorporate active learning with occasional human judgments to guide sampling
from the paper
Dynamically train BTD model for higher loss value combinations
from the paper
Further examine GPQA result for unsupervised evaluation on long-horizon planning and expensive evaluation tasks
from the paper

Author keywords

value alignment
Bradley-Terry model
EigenTrust
model disposition
constitutional AI

Something off? Let us know →

EigenBench: A Comparative Behavioral Measure of Value Alignment

Abstract

Author keywords

Related orals

LLM Fingerprinting via Semantically Conditioned Watermarks

Steering the Herd: A Framework for LLM-based Control of Social Learning

Every Language Model Has a Forgery-Resistant Signature

Gaussian certified unlearning in high dimensions: A hypothesis testing approach

Differentially Private Domain Discovery