ICLR 2026 Orals

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y. Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, Bingyi Kang

Vision & 3D Thu, Apr 23 · 3:27 PM–3:37 PM · 204 A/B Avg rating: 7.00 (6–8)
Author-provided TL;DR

Depth Anything 3 uses a single vanilla DINOv2 transformer to take arbitrary input views and outputs consistent depth and ray maps, delivering leading pose, geometry, and visual rendering performance.

Abstract

We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINOv2 encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 35.7\% in camera pose accuracy and 23.6\% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

DA3 predicts spatially consistent 3D geometry from arbitrary camera views using plain transformer and depth-ray targets.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Plain transformer backbone without architectural specialization is sufficient for any-view geometry prediction
  • Depth-ray prediction target eliminates need for complex multi-task learning
  • Teacher-student training achieves detail and generalization on par with Depth Anything 2 while enabling state-of-the-art performance
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Transformer architecture
  • Teacher-student training
  • Depth prediction
  • Camera pose estimation
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • KITTI
  • Matterport3D
  • Objaverse
  • Virtual KITTI 2
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Extend reasoning to dynamic scenes
    from the paper
  • Integrate language and interaction cues
    from the paper
  • Explore larger-scale pretraining to close loop between geometry understanding and actionable world models
    from the paper

Author keywords

  • Depth Estimation

Related orals

Something off? Let us know →