EigenBench: A Comparative Behavioral Measure of Value Alignment
Jonathn Chang, Leonhard Piff, Suvadip Sana, Jasmine Xinze Li, Lionel Levine
Abstract
Aligning AI with human values is a pressing unsolved problem. To address the lack of quantitative metrics for value alignment, we propose EigenBench: a black-box method for comparatively benchmarking language models’ values. Given an ensemble of models, a constitution describing a value system, and a dataset of scenarios, our method returns a vector of scores quantifying each model’s alignment to the given constitution. To produce these scores, each model judges the outputs of other models across many scenarios, and these judgments are aggregated with EigenTrust (Kamvar et al., 2003), yielding scores that reflect a weighted consensus judgment of the whole ensemble. EigenBench uses no ground truth labels, as it is designed to quantify subjective traits for which reasonable judges may disagree on the correct label. Hence, to validate our method, we collect human judgments on the same ensemble of models and show that EigenBench’s judgments align closely with those of human evaluators. We further demonstrate that EigenBench can recover model rankings on the GPQA benchmark without access to objective labels, supporting its viability as a framework for evaluating subjective values for which no ground truths exist. The code is available at https://github.com/jchang153/EigenBench.
EigenBench measures language model value alignment using model ensemble judgments aggregated with EigenTrust without ground truth labels.
- Black-box method for benchmarking LLM values using peer model judgments and EigenTrust aggregation
- Framework for quantifying subjective traits where no ground truth exists
- Validation through human judgment comparison and objective ranking recovery on GPQA
- EigenTrust aggregation
- peer model evaluation
- ensemble methods
- GPQA benchmark
Data collection process is inefficient requiring multiple model and comparison calls
from the paperLimited examination of GPQA result as unsupervised evaluation method
from the paper
Incorporate active learning with occasional human judgments to guide sampling
from the paperDynamically train BTD model for higher loss value combinations
from the paperFurther examine GPQA result for unsupervised evaluation on long-horizon planning and expensive evaluation tasks
from the paper
Author keywords
- value alignment
- Bradley-Terry model
- EigenTrust
- model disposition
- constitutional AI
Related orals
LLM Fingerprinting via Semantically Conditioned Watermarks
Introduces semantically conditioned watermarks for robust and stealthy LLM fingerprinting robust to deployment scenarios.
Steering the Herd: A Framework for LLM-based Control of Social Learning
Framework studying strategic control of social learning by algorithmic information mediators with theoretical analysis and LLM-based simulations.
Every Language Model Has a Forgery-Resistant Signature
Ellipse signatures function as forgery-resistant model output identifiers based on high-dimensional geometric constraints.
Gaussian certified unlearning in high dimensions: A hypothesis testing approach
Analyzes machine unlearning in high dimensions showing single noisy Newton step with Gaussian noise suffices for privacy-accuracy.
Differentially Private Domain Discovery
WGM-based methods provide efficient domain discovery with near-optimal guarantees for missing mass on Zipfian data.