Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y. Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, Bingyi Kang

Vision & 3D Thu, Apr 23 · 3:27 PM–3:37 PM · 204 A/B Avg rating: 7.00 (6–8)

OpenReview ↗ arXiv ↗ PDF ↗ iclr.cc ↗

Author-provided TL;DR

Depth Anything 3 uses a single vanilla DINOv2 transformer to take arbitrary input views and outputs consistent depth and ray maps, delivering leading pose, geometry, and visual rendering performance.

Abstract

We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINOv2 encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 35.7\% in camera pose accuracy and 23.6\% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

DA3 predicts spatially consistent 3D geometry from arbitrary camera views using plain transformer and depth-ray targets.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Plain transformer backbone without architectural specialization is sufficient for any-view geometry prediction
Depth-ray prediction target eliminates need for complex multi-task learning
Teacher-student training achieves detail and generalization on par with Depth Anything 2 while enabling state-of-the-art performance

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Transformer architecture
Teacher-student training
Depth prediction
Camera pose estimation

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

KITTI
Matterport3D
Objaverse
Virtual KITTI 2

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Extend reasoning to dynamic scenes
from the paper
Integrate language and interaction cues
from the paper
Explore larger-scale pretraining to close loop between geometry understanding and actionable world models
from the paper

Author keywords

Depth Estimation

Something off? Let us know →

Depth Anything 3: Recovering the Visual Space from Any Views

Abstract

Author keywords

Related orals

Improving Diffusion Models for Class-imbalanced Training Data via Capacity Manipulation

Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

Radiometrically Consistent Gaussian Surfels for Inverse Rendering

True Self-Supervised Novel View Synthesis is Transferable

Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation