Depth Anything 3: Recovering the Visual Space from Any Views
Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y. Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, Bingyi Kang
Depth Anything 3 uses a single vanilla DINOv2 transformer to take arbitrary input views and outputs consistent depth and ray maps, delivering leading pose, geometry, and visual rendering performance.
Abstract
We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINOv2 encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 35.7\% in camera pose accuracy and 23.6\% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.
DA3 predicts spatially consistent 3D geometry from arbitrary camera views using plain transformer and depth-ray targets.
- Plain transformer backbone without architectural specialization is sufficient for any-view geometry prediction
- Depth-ray prediction target eliminates need for complex multi-task learning
- Teacher-student training achieves detail and generalization on par with Depth Anything 2 while enabling state-of-the-art performance
- Transformer architecture
- Teacher-student training
- Depth prediction
- Camera pose estimation
- KITTI
- Matterport3D
- Objaverse
- Virtual KITTI 2
Authors did not state explicit limitations.
Extend reasoning to dynamic scenes
from the paperIntegrate language and interaction cues
from the paperExplore larger-scale pretraining to close loop between geometry understanding and actionable world models
from the paper
Author keywords
- Depth Estimation
Related orals
Improving Diffusion Models for Class-imbalanced Training Data via Capacity Manipulation
Capacity manipulation improves diffusion models' handling of class-imbalanced data by reserving capacity for minority classes via low-rank decomposition.
Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator
VIST3A stitches text-to-video models with 3D reconstruction systems and aligns them via reward finetuning for high-quality text-to-3D generation.
Radiometrically Consistent Gaussian Surfels for Inverse Rendering
RadioGS introduces radiometric consistency supervision for inverse rendering to accurately model indirect illumination in Gaussian-based representations.
True Self-Supervised Novel View Synthesis is Transferable
Presents XFactor, first geometry-free self-supervised model for transferable novel view synthesis without 3D inductive biases.
Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation
Introduces parallel decoding for autoregressive image generation with flexible ordering achieving 3.4x latency reduction.