Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

Hyojun Go, Dominik Narnhofer, Goutam Bhat, Prune Truong, Federico Tombari, Konrad Schindler

Vision & 3D Thu, Apr 23 · 3:39 PM–3:49 PM · 204 A/B Avg rating: 8.00 (8–8)

Author-provided TL;DR

Text-to-3D scene generative modelling by unifying a video generative model with a foundational 3D model via model stitching and alignment.

Abstract

The rapid progress of large, pretrained models for both visual content generation and 3D reconstruction opens up new possibilities for text-to-3D generation. Intuitively, one could obtain a formidable 3D scene generator if one were able to combine the power of a modern latent text-to-video model as "generator" with the geometric abilities of a recent (feedforward) 3D reconstruction system as "decoder". We introduce **VIST3A**, a general framework that does just that, addressing two main challenges. First, the two components must be joined in a way that preserves the rich knowledge encoded in their weights. We revisit *model stitching*, i.e., we identify the layer in the 3D decoder that best matches the latent representation produced by the text-to-video generator and stitch the two parts together. That operation requires only a small dataset and no labels. Second, the text-to-video generator must be aligned with the stitched 3D decoder, to ensure that the generated latents are decodable into consistent, perceptually convincing 3D scene geometry. To that end, we adapt *direct reward finetuning*, a popular technique for human preference alignment. We evaluate the proposed VIST3A approach with different video generators and 3D reconstruction models. All tested pairings markedly improve over prior text-to-3D models that output Gaussian splats. Moreover, by choosing a suitable 3D base model, VIST3A also enables high-quality text-to-pointmap generation.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

VIST3A stitches text-to-video models with 3D reconstruction systems and aligns them via reward finetuning for high-quality text-to-3D generation.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Introduces model stitching approach to integrate generative abilities of video models with 3D understanding of feedforward 3D reconstruction models
Develops reward-based finetuning strategy to align latent-space video generator with stitched 3D decoder
Enables high-quality text-to-3D generation and extends to other outputs like pointmaps and depthmaps from 3D base models

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Model stitching
Reward finetuning
Diffusion models

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Stitched model inherits encoder from video generation model designed for sequential, temporally coherent input, limiting performance on unordered inputs typical of multi-view datasets
from the paper
Input images must be arranged in coherent sequence simulating smooth view transitions for encoder to operate effectively
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

Text-to-3D generation
Video Diffusion Model
3D Gaussian Splatting
Generation

Something off? Let us know →

Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

Abstract

Author keywords

Related orals

Improving Diffusion Models for Class-imbalanced Training Data via Capacity Manipulation

Depth Anything 3: Recovering the Visual Space from Any Views

Radiometrically Consistent Gaussian Surfels for Inverse Rendering

True Self-Supervised Novel View Synthesis is Transferable

Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation