World-In-World: World Models in a Closed-Loop World

Jiahan Zhang, Muqing Jiang, Nanru Dai, TaiMing Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M. Patel, Paul Pu Liang, Daniel Khashabi, Cheng Peng, Rama Chellappa, Tianmin Shu, Alan Yuille, Yilun Du, Jieneng Chen

Datasets, Benchmarks & Evaluation Fri, Apr 24 · 3:15 PM–3:25 PM · 201 A/B Avg rating: 7.00 (6–8)

OpenReview ↗ arXiv ↗ PDF ↗ iclr.cc ↗

Author-provided TL;DR

By grounding assessment in embodied task success instead of video metrics, World-In-World provides a principled yardstick for future research on generative world models in the context of embodiment

Abstract

Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., *do WMs actually help agents succeed at embodied tasks?* To address this gap, we introduce World-In-World, the first open platform that benchmarks WMs in a closed-loop setting that mirrors real agent-environment interactions. World-In-World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success—controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance. By centering evaluation on closed-loop outcomes, World-In-World establishes a new benchmark for the systematic assessment of WMs.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Introduces closed-loop benchmark evaluating generative world models on embodied task performance rather than visual quality.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

World-In-World, first open platform benchmarking world models in closed-loop setting mirroring real agent-environment interactions
Unified online planning strategy and standardized action API enabling heterogeneous world models for decision making
First data scaling law for world models in embodied settings showing visual quality alone insufficient for task success

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

World models
Video generation
Diffusion models
Planning
Embodied AI

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Generalization capacity improvements through unified action representations and curriculum domain-specific data collection
from the paper
Better encoding and retrieval of long-term dependencies via spatial and episode-level memory
from the paper
Physics-guided motion generation and physics-aware reinforcement post-training for precise dynamics modeling
from the paper
Development of stronger proposal and revision policies improving performance floor and ceiling
from the paper
Investigation of more efficient world-model architectures, training/inference strategies, and distillation techniques
from the paper

Author keywords

world models
video generation
embodied AI
generative models

Something off? Let us know →

World-In-World: World Models in a Closed-Loop World

Abstract

Author keywords

Related orals

On the Wasserstein Geodesic Principal Component Analysis of probability measures

TabStruct: Measuring Structural Fidelity of Tabular Data

Monocular Normal Estimation via Shading Sequence Estimation

TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits