True Self-Supervised Novel View Synthesis is Transferable

Thomas Mitchel, Hyunwoo Ryu, Vincent Sitzmann

Vision & 3D Thu, Apr 23 · 4:15 PM–4:25 PM · 204 A/B Avg rating: 6.00 (4–8)

Author-provided TL;DR

The key criterion for determining whether a models is capable of NVS is transferability, and we present the first fully geometry-free and self-supervised model capable of it.

Abstract

In this paper, we identify that the key criterion for determining whether a model is truly capable of novel view synthesis (NVS) is transferability: Whether any pose representation extracted from one video sequence can be used to re-render the same camera trajectory in another. We analyze prior work on self-supervised NVS and find that their predicted poses do not transfer: The same set of poses lead to different camera trajectories in different 3D scenes. Here, we present XFactor, the first geometry-free self-supervised model capable of true NVS. XFactor combines pair-wise pose estimation with a simple augmentation scheme of the inputs and outputs that jointly enables disentangling camera pose from scene content and facilitates geometric reasoning. Remarkably, we show that XFactor achieves transferability with unconstrained latent pose variables, without any 3D inductive biases or concepts from multi-view geometry — such as an explicit parameterization of poses as elements of SE(3). We introduce a new metric to quantify transferability, and through large-scale experiments, we demonstrate that XFactor significantly outperforms prior pose-free NVS transformers, and show that latent poses are highly correlated with real-world poses through probing experiments.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Presents XFactor, first geometry-free self-supervised model for transferable novel view synthesis without 3D inductive biases.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Demonstrates transferability as key criterion for true novel view synthesis capability
Achieves latent pose learning without explicit 3D parameterization or multi-view geometry concepts
New metric quantifying transferability enables large-scale transfer evaluation

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Self-supervised learning
Pose estimation
Novel view synthesis
Latent variable models
Augmentation strategies

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

POSEENC restriction to stereo model precludes ultra-wide baseline pose estimation in single forward pass
from the paper
Rendering quality exhibits blurring and warping artifacts increasing as target poses diverge from context
from the paper
Model is deterministic rather than generative, limiting its ability to resolve uncertainty
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Integrate recent advances in camera-controllable generative models to address rendering artifacts
from the paper

Author keywords

Novel View Synthesis
Self-Supervised
Unsupervised
Representation Learning

Something off? Let us know →

True Self-Supervised Novel View Synthesis is Transferable

Abstract

Author keywords

Related orals

Improving Diffusion Models for Class-imbalanced Training Data via Capacity Manipulation

Depth Anything 3: Recovering the Visual Space from Any Views

Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

Radiometrically Consistent Gaussian Surfels for Inverse Rendering

Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation