Monocular Normal Estimation via Shading Sequence Estimation
Zongrui Li, Xinhua Ma, Minghui Hu, Yunqing Zhao, Yingchen Yu, Qian Zheng, Chang Liu, Xudong Jiang, Song Bai
Abstract
Monocular normal estimation aims to estimate the normal map from a single RGB image of an object under arbitrary lights. Existing methods rely on deep models to directly predict normal maps. However, they often suffer from 3D misalignment: while the estimated normal maps may appear to have a correct appearance, the reconstructed surfaces often fail to align with the 3D geometry. We argue that this misalignment stems from the current paradigm: the model struggles to distinguish and estimate varying geometry represented in normal maps, as the differences in underlying geometry are reflected only through relatively subtle color variations. To address this issue, we propose a new paradigm that reformulates normal estimation as shading sequence estimation, where shading sequences are more sensitive to various geometry information. By learning to infer the shading sequence of an object, the model can better capture underlying 3D geometry and thereby produce more accurate normal predictions. Building on this paradigm, we present RoSE, a method that leverages image-to-video generative models to predict shading sequences, which are then converted into normal maps by solving a simple ordinary least-squares problem. To enhance robustness and better handle complex objects, RoSE is trained on a synthetic dataset, MultiShade, with diverse shapes, materials, and light conditions. Experiments demonstrate that RoSE achieves state-of-the-art performance on both synthetic and real-world benchmark datasets for object-based monocular normal estimation.
RoSE estimates surface normals via shading sequence prediction, addressing 3D misalignment in monocular normal estimation.
- New paradigm reformulating monocular normal estimation as shading sequence estimation for better geometry capture
- Leverages image-to-video generative models to predict shading sequences converted to normals via OLS solver
- Trains on MultiShade synthetic dataset with diverse shapes, materials and light conditions for robustness
- Diffusion models
- Image-to-video generation
- Ordinary least squares solving
- Synthetic dataset training
- MultiShade
- DiLiGenT
- LUCES
Video diffusion models introduce computational overhead limiting real-time applicability
from the paperStruggles under extreme lighting conditions with large regions of insufficient illumination
from the paperFails on transparent or semi-transparent objects
from the paperPrimary evaluation object-centric, scene-centric extension remains open
from the paper
Reduce computational overhead of video diffusion models for real-time use
from the paperImprove handling of extreme lighting conditions and insufficient illumination regions
from the paperSupport transparent and semi-transparent object normal estimation
from the paperExtend to scene-centric settings beyond single object focus
from the paper
Author keywords
- Video Diffusion Model
- Shading Estimation
- Single-view Normal Estimation
Related orals
On the Wasserstein Geodesic Principal Component Analysis of probability measures
Geodesic PCA for probability distributions using Wasserstein geometry with neural network parametrization for continuous distributions.
TabStruct: Measuring Structural Fidelity of Tabular Data
TabStruct benchmark evaluates tabular data generators on structural fidelity and conventional dimensions using global utility metric without ground-truth causal structures.
TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems
TTSDS2 metric robustly correlates with human judgments for TTS evaluation across diverse speech domains maintaining >0.5 Spearman correlation.
World-In-World: World Models in a Closed-Loop World
Introduces closed-loop benchmark evaluating generative world models on embodied task performance rather than visual quality.
EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits
Introduces EditBench benchmark for real-world LLM code editing with 545 problems from actual developer usage.