Datasets, Benchmarks & Evaluation

New datasets, benchmarks, evaluation protocols, and studies of existing benchmarks.

All papers

Min rating

Sort

$PhyWorldBench$: A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

PhyWorldBench evaluates text-to-video models on physics adherence across fundamental, composite, and anti-physics scenarios.

Avg rating: 5.50 (4–6) · Jing Gu et al.

CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale

CyberGym benchmarks AI agents on 1,507 real-world vulnerabilities discovering 34 zero-days, showing top models achieve only 22% success on PoC generation.

Avg rating: 7.00 (6–8) · Zhun Wang et al.

EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

Introduces EditBench benchmark for real-world LLM code editing with 545 problems from actual developer usage.

Avg rating: 7.50 (6–10) · Wayne Chi et al.

Monocular Normal Estimation via Shading Sequence Estimation

RoSE estimates surface normals via shading sequence prediction, addressing 3D misalignment in monocular normal estimation.

Avg rating: 6.40 (6–8) · Zongrui Li et al.

On the Wasserstein Geodesic Principal Component Analysis of probability measures

Geodesic PCA for probability distributions using Wasserstein geometry with neural network parametrization for continuous distributions.

Avg rating: 7.00 (4–10) · Nina Vesseron et al.

RealPDEBench: A Benchmark for Complex Physical Systems with Real-World Data

RealPDEBench first benchmark integrating real-world measurements with paired simulations across five physical systems for scientific ML evaluation.

Avg rating: 7.50 (4–10) · Peiyan Hu et al.

TabStruct: Measuring Structural Fidelity of Tabular Data

TabStruct benchmark evaluates tabular data generators on structural fidelity and conventional dimensions using global utility metric without ground-truth causal structures.

Avg rating: 7.00 (4–10) · Xiangjian Jiang et al.

TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

TTSDS2 metric robustly correlates with human judgments for TTS evaluation across diverse speech domains maintaining >0.5 Spearman correlation.

Avg rating: 5.33 (2–8) · Christoph Minixhofer et al.

World-In-World: World Models in a Closed-Loop World

Introduces closed-loop benchmark evaluating generative world models on embodied task performance rather than visual quality.

Avg rating: 7.00 (6–8) · Jiahan Zhang et al.