Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator
Hyojun Go, Dominik Narnhofer, Goutam Bhat, Prune Truong, Federico Tombari, Konrad Schindler
Text-to-3D scene generative modelling by unifying a video generative model with a foundational 3D model via model stitching and alignment.
Abstract
The rapid progress of large, pretrained models for both visual content generation and 3D reconstruction opens up new possibilities for text-to-3D generation. Intuitively, one could obtain a formidable 3D scene generator if one were able to combine the power of a modern latent text-to-video model as "generator" with the geometric abilities of a recent (feedforward) 3D reconstruction system as "decoder". We introduce **VIST3A**, a general framework that does just that, addressing two main challenges. First, the two components must be joined in a way that preserves the rich knowledge encoded in their weights. We revisit *model stitching*, i.e., we identify the layer in the 3D decoder that best matches the latent representation produced by the text-to-video generator and stitch the two parts together. That operation requires only a small dataset and no labels. Second, the text-to-video generator must be aligned with the stitched 3D decoder, to ensure that the generated latents are decodable into consistent, perceptually convincing 3D scene geometry. To that end, we adapt *direct reward finetuning*, a popular technique for human preference alignment. We evaluate the proposed VIST3A approach with different video generators and 3D reconstruction models. All tested pairings markedly improve over prior text-to-3D models that output Gaussian splats. Moreover, by choosing a suitable 3D base model, VIST3A also enables high-quality text-to-pointmap generation.
VIST3A stitches text-to-video models with 3D reconstruction systems and aligns them via reward finetuning for high-quality text-to-3D generation.
- Introduces model stitching approach to integrate generative abilities of video models with 3D understanding of feedforward 3D reconstruction models
- Develops reward-based finetuning strategy to align latent-space video generator with stitched 3D decoder
- Enables high-quality text-to-3D generation and extends to other outputs like pointmaps and depthmaps from 3D base models
- Model stitching
- Reward finetuning
- Diffusion models
Stitched model inherits encoder from video generation model designed for sequential, temporally coherent input, limiting performance on unordered inputs typical of multi-view datasets
from the paperInput images must be arranged in coherent sequence simulating smooth view transitions for encoder to operate effectively
from the paper
Authors did not state explicit future directions.
Author keywords
- Text-to-3D generation
- Video Diffusion Model
- 3D Gaussian Splatting
- Generation
Related orals
Improving Diffusion Models for Class-imbalanced Training Data via Capacity Manipulation
Capacity manipulation improves diffusion models' handling of class-imbalanced data by reserving capacity for minority classes via low-rank decomposition.
Depth Anything 3: Recovering the Visual Space from Any Views
DA3 predicts spatially consistent 3D geometry from arbitrary camera views using plain transformer and depth-ray targets.
Radiometrically Consistent Gaussian Surfels for Inverse Rendering
RadioGS introduces radiometric consistency supervision for inverse rendering to accurately model indirect illumination in Gaussian-based representations.
True Self-Supervised Novel View Synthesis is Transferable
Presents XFactor, first geometry-free self-supervised model for transferable novel view synthesis without 3D inductive biases.
Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation
Introduces parallel decoding for autoregressive image generation with flexible ordering achieving 3.4x latency reduction.