AnyUp: Universal Feature Upsampling
Thomas Wimmer, Prune Truong, Marie-Julie Rakotosaona, Michael Oechsle, Federico Tombari, Bernt Schiele, Jan Eric Lenssen
A universal feature upsampling model that can be used to upsample any feature from any to any resolution and generalizes to features unseen during training.
Abstract
We introduce AnyUp, a method for feature upsampling that can be applied to any vision feature at any resolution, without encoder-specific training. Existing learning-based upsamplers for features like DINO or CLIP need to be re-trained for every feature extractor and thus do not generalize to different feature types at inference time. In this work, we propose an *inference-time* feature-agnostic upsampling architecture to alleviate this limitation and improve upsampling quality. In our experiments, AnyUp sets a new state of the art for upsampled features, generalizes to different feature types, and preserves feature semantics while being efficient and easy to apply to a wide range of downstream tasks.
AnyUp inference-time feature upsampler generalizes across different feature types and resolutions without encoder-specific retraining.
- First feature-agnostic method for upsampling at inference time to any resolution
- Feature-agnostic layer, windowed attention, and training strategy enabling generalization to unseen feature types
- State-of-the-art upsampling quality while preserving feature semantics across diverse downstream tasks
- feature upsampling
- attention mechanisms
- feature-agnostic architecture
Relies on simplifying assumption that upsampled features are linear combinations of low-resolution inputs
from the paperDoes not extract sub-patch-level spatial information encoded in high-dimensional channels
from the paper
Explore larger, more complex upsampling models to extract additional information from patch features
from the paper
Author keywords
- feature upsampling
- representation learning
Related orals
Improving Diffusion Models for Class-imbalanced Training Data via Capacity Manipulation
Capacity manipulation improves diffusion models' handling of class-imbalanced data by reserving capacity for minority classes via low-rank decomposition.
Depth Anything 3: Recovering the Visual Space from Any Views
DA3 predicts spatially consistent 3D geometry from arbitrary camera views using plain transformer and depth-ray targets.
Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator
VIST3A stitches text-to-video models with 3D reconstruction systems and aligns them via reward finetuning for high-quality text-to-3D generation.
Radiometrically Consistent Gaussian Surfels for Inverse Rendering
RadioGS introduces radiometric consistency supervision for inverse rendering to accurately model indirect illumination in Gaussian-based representations.
True Self-Supervised Novel View Synthesis is Transferable
Presents XFactor, first geometry-free self-supervised model for transferable novel view synthesis without 3D inductive biases.