$PhyWorldBench$: A Comprehensive Evaluation of Physical Realism in Text-to-Video Models
PhyWorldBench evaluates text-to-video models on physics adherence across fundamental, composite, and anti-physics scenarios.
New datasets, benchmarks, evaluation protocols, and studies of existing benchmarks.
PhyWorldBench evaluates text-to-video models on physics adherence across fundamental, composite, and anti-physics scenarios.
CyberGym benchmarks AI agents on 1,507 real-world vulnerabilities discovering 34 zero-days, showing top models achieve only 22% success on PoC generation.
Introduces EditBench benchmark for real-world LLM code editing with 545 problems from actual developer usage.
RoSE estimates surface normals via shading sequence prediction, addressing 3D misalignment in monocular normal estimation.
Geodesic PCA for probability distributions using Wasserstein geometry with neural network parametrization for continuous distributions.
RealPDEBench first benchmark integrating real-world measurements with paired simulations across five physical systems for scientific ML evaluation.
TabStruct benchmark evaluates tabular data generators on structural fidelity and conventional dimensions using global utility metric without ground-truth causal structures.
TTSDS2 metric robustly correlates with human judgments for TTS evaluation across diverse speech domains maintaining >0.5 Spearman correlation.
Introduces closed-loop benchmark evaluating generative world models on embodied task performance rather than visual quality.