$PhyWorldBench$: A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, Xin Eric Wang

Datasets, Benchmarks & Evaluation Sat, Apr 25 · 11:30 AM–11:40 AM · 201 A/B Avg rating: 5.50 (4–6)

OpenReview ↗ arXiv ↗ PDF ↗ iclr.cc ↗

Author-provided TL;DR

Large-scale, multidimensional video generation for physics

Abstract

Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accurately simulate physical phenomena remains a critical and unresolved challenge. This paper presents $PhyWorldBench$ , a comprehensive benchmark designed to evaluate video generation models based on their adherence to the laws of physics. The benchmark covers multiple levels of physical phenomena, ranging from fundamental principles like object motion and energy conservation to more complex scenarios involving rigid body interactions and human or animal motion. Additionally, we introduce a novel "Anti-Physics" category, where prompts intentionally violate real-world physics, enabling the assessment of whether models can follow such instructions while maintaining logical consistency. Besides large-scale human evaluation, we also design a simple yet effective method that could utilize current MLLM to evaluate the physics realism in a zero-shot fashion. We evaluate 10 state-of-the-art text-to-video generation models, including five open-source and five proprietary models, with a detailed comparison and analysis. we identify pivotal challenges models face in adhering to real-world physics. Through systematic testing of their outputs across 1,050 curated prompts—spanning fundamental, composite, and anti-physics scenarios—we identify pivotal challenges these models face in adhering to real-world physics. We then rigorously examine their performance on diverse physical phenomena with varying prompt types, deriving targeted recommendations for crafting prompts that enhance fidelity to physical principles.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

PhyWorldBench evaluates text-to-video models on physics adherence across fundamental, composite, and anti-physics scenarios.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Comprehensive benchmark with 1,050 curated prompts covering 10 physics phenomenon categories
Zero-shot MLLM evaluation method for assessing physics realism without additional training
Evaluation of 10 state-of-the-art models identifying challenges in physics adherence
Anti-physics category enabling assessment of instruction following while maintaining consistency

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Video generation evaluation
Multi-modal language models
Human evaluation

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

PhyWorldBench

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

1,050 prompts may not fully capture specialized or highly intricate domain-specific interactions
from the paper
CAP evaluator shows slight preference for polished visual presentation affecting physics evaluation
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Expand prompt collection to include additional niche phenomena
from the paper
Enhance evaluation workflow to distinguish visual style from physical correctness
from the paper

Author keywords

Video Generation
Video Evaluation

Something off? Let us know →

$PhyWorldBench$: A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Abstract

Author keywords

Related orals

On the Wasserstein Geodesic Principal Component Analysis of probability measures

TabStruct: Measuring Structural Fidelity of Tabular Data

Monocular Normal Estimation via Shading Sequence Estimation

TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

World-In-World: World Models in a Closed-Loop World