FlashWorld: High-quality 3D Scene Generation within Seconds

Xinyang Li, Tengfei Wang, Zixiao Gu, Shengchuan Zhang, Chunchao Guo, Liujuan Cao

LLMs & Reasoning Sat, Apr 25 · 10:54 AM–11:04 AM · 201 A/B Avg rating: 6.00 (6–6)

Author-provided TL;DR

a generative model that produces high-quality 3D scenes from a single image or text prompt in seconds

Abstract

We propose FlashWorld, a generative model that produces 3D scenes from a single image or text prompt in seconds, $10 \sim 100\times$ faster than previous works while possessing superior rendering quality. Our approach shifts from the conventional multi-view-oriented (MV-oriented) paradigm, which generates multi-view images for subsequent 3D reconstruction, to a 3D-oriented approach where the model directly produces 3D Gaussian representations during multi-view generation. While ensuring 3D consistency, 3D-oriented method typically suffers poor visual quality. FlashWorld includes a dual-mode pre-training phase followed by a cross-mode post-training phase, effectively integrating the strengths of both paradigms. Specifically, leveraging the prior from a video diffusion model, we first pre-train a dual-mode multi-view diffusion model, which jointly supports MV-oriented and 3D-oriented generation mode. To bridge the quality gap in 3D-oriented generation, we further propose a cross-mode post-training distillation by matching distribution from consistent 3D-oriented mode to high-quality MV-oriented mode. This not only enhances visual quality while maintaining 3D consistency, but also reduces the required denoising steps for inference. Also, we propose a strategy to leverage massive single-view images and text prompts during this process to enhance the model's generalization to out-of-distribution inputs. Extensive experiments demonstrate the superiority and efficiency of our method. Our code is released at https://github.com/imlixinyang/FlashWorld.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Proposes FlashWorld generating high-quality 3D scenes in seconds using dual-mode diffusion with cross-mode distillation.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Shift from multi-view-oriented to 3D-oriented paradigm directly producing 3D Gaussian representations
Dual-mode pre-training phase supporting both MV-oriented and 3D-oriented generation modes
Cross-mode post-training distillation matching MV-oriented distribution to enhance 3D-oriented quality while maintaining consistency

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Diffusion models
3D Gaussian representations
Multi-view generation
Video diffusion models
Knowledge distillation

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

3D Scene Generation
Multi-view Diffusion Models
World Models
Distribution Matching Distillation

Something off? Let us know →

FlashWorld: High-quality 3D Scene Generation within Seconds

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis