WAFT: Warping-Alone Field Transforms for Optical Flow
Yihan Wang, Jia Deng
Abstract
We introduce Warping-Alone Field Transforms (WAFT), a simple and effective method for optical flow. WAFT is similar to RAFT but replaces cost volume with high-resolution warping, achieving better accuracy with lower memory cost. This design challenges the conventional wisdom that constructing cost volumes is necessary for strong performance. WAFT is a simple and flexible meta-architecture with minimal inductive biases and reliance on custom designs. Compared with existing methods, WAFT ranks 1st on Spring, Sintel, and KITTI benchmarks, achieves the best zero-shot generalization on KITTI, while being 1.3-4.1x faster than existing methods that have competitive accuracy (e.g., 1.3x than Flowformer++, 4.1x than CCMR+). Code and model weights are available at https://github.com/princeton-vl/WAFT.
WAFT replaces cost volumes with high-resolution warping for optical flow, ranking first on Spring, Sintel, and KITTI with 1.3-4.1x faster inference.
- Proposes simple and flexible WAFT meta-architecture replacing cost volumes with high-resolution warping
- Achieves state-of-the-art results on Spring, Sintel, and KITTI benchmarks with best zero-shot generalization
- Demonstrates significantly faster inference speed than existing competitive methods
- High-resolution warping
- Iterative updates
- Feature-space warping
- Spring
- Sintel
- KITTI
Authors did not state explicit limitations.
Authors did not state explicit future directions.
Author keywords
- Optical Flow; Computer Vision; Warping; Dense Correspondences
Related orals
Improving Diffusion Models for Class-imbalanced Training Data via Capacity Manipulation
Capacity manipulation improves diffusion models' handling of class-imbalanced data by reserving capacity for minority classes via low-rank decomposition.
Depth Anything 3: Recovering the Visual Space from Any Views
DA3 predicts spatially consistent 3D geometry from arbitrary camera views using plain transformer and depth-ray targets.
Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator
VIST3A stitches text-to-video models with 3D reconstruction systems and aligns them via reward finetuning for high-quality text-to-3D generation.
Radiometrically Consistent Gaussian Surfels for Inverse Rendering
RadioGS introduces radiometric consistency supervision for inverse rendering to accurately model indirect illumination in Gaussian-based representations.
True Self-Supervised Novel View Synthesis is Transferable
Presents XFactor, first geometry-free self-supervised model for transferable novel view synthesis without 3D inductive biases.