FlashWorld: High-quality 3D Scene Generation within Seconds
Xinyang Li, Tengfei Wang, Zixiao Gu, Shengchuan Zhang, Chunchao Guo, Liujuan Cao
a generative model that produces high-quality 3D scenes from a single image or text prompt in seconds
Abstract
We propose FlashWorld, a generative model that produces 3D scenes from a single image or text prompt in seconds, $10 \sim 100\times$ faster than previous works while possessing superior rendering quality. Our approach shifts from the conventional multi-view-oriented (MV-oriented) paradigm, which generates multi-view images for subsequent 3D reconstruction, to a 3D-oriented approach where the model directly produces 3D Gaussian representations during multi-view generation. While ensuring 3D consistency, 3D-oriented method typically suffers poor visual quality. FlashWorld includes a dual-mode pre-training phase followed by a cross-mode post-training phase, effectively integrating the strengths of both paradigms. Specifically, leveraging the prior from a video diffusion model, we first pre-train a dual-mode multi-view diffusion model, which jointly supports MV-oriented and 3D-oriented generation mode. To bridge the quality gap in 3D-oriented generation, we further propose a cross-mode post-training distillation by matching distribution from consistent 3D-oriented mode to high-quality MV-oriented mode. This not only enhances visual quality while maintaining 3D consistency, but also reduces the required denoising steps for inference. Also, we propose a strategy to leverage massive single-view images and text prompts during this process to enhance the model's generalization to out-of-distribution inputs. Extensive experiments demonstrate the superiority and efficiency of our method. Our code is released at https://github.com/imlixinyang/FlashWorld.
Proposes FlashWorld generating high-quality 3D scenes in seconds using dual-mode diffusion with cross-mode distillation.
- Shift from multi-view-oriented to 3D-oriented paradigm directly producing 3D Gaussian representations
- Dual-mode pre-training phase supporting both MV-oriented and 3D-oriented generation modes
- Cross-mode post-training distillation matching MV-oriented distribution to enhance 3D-oriented quality while maintaining consistency
- Diffusion models
- 3D Gaussian representations
- Multi-view generation
- Video diffusion models
- Knowledge distillation
Authors did not state explicit limitations.
Authors did not state explicit future directions.
Author keywords
- 3D Scene Generation
- Multi-view Diffusion Models
- World Models
- Distribution Matching Distillation
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.