ICLR 2026 Orals

A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration

Rohit Jena, Vedant Zope, Pratik Chaudhari, James Gee

LLMs & Reasoning Sat, Apr 25 · 10:42 AM–10:52 AM · 201 C Avg rating: 6.50 (2–10)
Author-provided TL;DR

we propose non-GEMM CUDA kernels and distributed primitives to scale multimodal image registration to arbitrary image sizes

Abstract

In this work, we propose FFDP, a set of IO-aware non-GEMM fused kernels supplemented with a distributed framework for image registration at unprecedented scales. Image registration is an inverse problem fundamental to biomedical and life sciences, but algorithms have not scaled in tandem with image acquisition capabilities. Our framework complements existing model parallelism techniques proposed for large-scale transformer training by optimizing non-GEMM bottlenecks and enabling convolution-aware tensor sharding. We demonstrate unprecedented capabilities by performing multimodal registration of a 100μm ex-vivo human brain MRI volume at native resolution – an inverse problem more than 570× larger than a standard clinical datum in about a minute using only 8 A6000 GPUs. FFDP accelerates existing state-of-the-art optimization and deep learning registration pipelines by upto 6 − 7× while reducing peak memory consumption by 20 − 59%. Comparative analysis on a 250μm dataset shows that FFDP can fit upto 64× larger problems than existing SOTA on a single GPU, and highlights both the performance and efficiency gains of FFDP compared to SOTA image registration methods.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

FFDP framework scales image registration to 100μm human brain MRI volumes using IO-aware kernels and distributed tensor sharding.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • IO-aware non-GEMM fused kernels and distributed framework for large-scale image registration
  • Convolution-aware tensor sharding complementing model parallelism techniques
  • Multimodal registration at unprecedented 570x larger scale than clinical standard
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Image registration
  • Distributed computing
  • Kernel optimization
  • Tensor parallelism
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

  • image registration
  • distributed optimization
  • CUDA kernels
  • neuroanatomy

Related orals

Something off? Let us know →