ICLR 2026 Orals

Monocular Normal Estimation via Shading Sequence Estimation

Zongrui Li, Xinhua Ma, Minghui Hu, Yunqing Zhao, Yingchen Yu, Qian Zheng, Chang Liu, Xudong Jiang, Song Bai

Datasets, Benchmarks & Evaluation Thu, Apr 23 · 3:51 PM–4:01 PM · 204 A/B Avg rating: 6.40 (6–8)

Abstract

Monocular normal estimation aims to estimate the normal map from a single RGB image of an object under arbitrary lights. Existing methods rely on deep models to directly predict normal maps. However, they often suffer from 3D misalignment: while the estimated normal maps may appear to have a correct appearance, the reconstructed surfaces often fail to align with the 3D geometry. We argue that this misalignment stems from the current paradigm: the model struggles to distinguish and estimate varying geometry represented in normal maps, as the differences in underlying geometry are reflected only through relatively subtle color variations. To address this issue, we propose a new paradigm that reformulates normal estimation as shading sequence estimation, where shading sequences are more sensitive to various geometry information. By learning to infer the shading sequence of an object, the model can better capture underlying 3D geometry and thereby produce more accurate normal predictions. Building on this paradigm, we present RoSE, a method that leverages image-to-video generative models to predict shading sequences, which are then converted into normal maps by solving a simple ordinary least-squares problem. To enhance robustness and better handle complex objects, RoSE is trained on a synthetic dataset, MultiShade, with diverse shapes, materials, and light conditions. Experiments demonstrate that RoSE achieves state-of-the-art performance on both synthetic and real-world benchmark datasets for object-based monocular normal estimation.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

RoSE estimates surface normals via shading sequence prediction, addressing 3D misalignment in monocular normal estimation.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • New paradigm reformulating monocular normal estimation as shading sequence estimation for better geometry capture
  • Leverages image-to-video generative models to predict shading sequences converted to normals via OLS solver
  • Trains on MultiShade synthetic dataset with diverse shapes, materials and light conditions for robustness
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Diffusion models
  • Image-to-video generation
  • Ordinary least squares solving
  • Synthetic dataset training
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • MultiShade
  • DiLiGenT
  • LUCES
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Video diffusion models introduce computational overhead limiting real-time applicability
    from the paper
  • Struggles under extreme lighting conditions with large regions of insufficient illumination
    from the paper
  • Fails on transparent or semi-transparent objects
    from the paper
  • Primary evaluation object-centric, scene-centric extension remains open
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Reduce computational overhead of video diffusion models for real-time use
    from the paper
  • Improve handling of extreme lighting conditions and insufficient illumination regions
    from the paper
  • Support transparent and semi-transparent object normal estimation
    from the paper
  • Extend to scene-centric settings beyond single object focus
    from the paper

Author keywords

  • Video Diffusion Model
  • Shading Estimation
  • Single-view Normal Estimation

Related orals

Something off? Let us know →