A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration

Rohit Jena, Vedant Zope, Pratik Chaudhari, James Gee

LLMs & Reasoning Sat, Apr 25 · 10:42 AM–10:52 AM · 201 C Avg rating: 6.50 (2–10)

Author-provided TL;DR

we propose non-GEMM CUDA kernels and distributed primitives to scale multimodal image registration to arbitrary image sizes

Abstract

In this work, we propose FFDP, a set of IO-aware non-GEMM fused kernels supplemented with a distributed framework for image registration at unprecedented scales. Image registration is an inverse problem fundamental to biomedical and life sciences, but algorithms have not scaled in tandem with image acquisition capabilities. Our framework complements existing model parallelism techniques proposed for large-scale transformer training by optimizing non-GEMM bottlenecks and enabling convolution-aware tensor sharding. We demonstrate unprecedented capabilities by performing multimodal registration of a 100μm ex-vivo human brain MRI volume at native resolution – an inverse problem more than 570× larger than a standard clinical datum in about a minute using only 8 A6000 GPUs. FFDP accelerates existing state-of-the-art optimization and deep learning registration pipelines by upto 6 − 7× while reducing peak memory consumption by 20 − 59%. Comparative analysis on a 250μm dataset shows that FFDP can fit upto 64× larger problems than existing SOTA on a single GPU, and highlights both the performance and efficiency gains of FFDP compared to SOTA image registration methods.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

FFDP framework scales image registration to 100μm human brain MRI volumes using IO-aware kernels and distributed tensor sharding.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

IO-aware non-GEMM fused kernels and distributed framework for large-scale image registration
Convolution-aware tensor sharding complementing model parallelism techniques
Multimodal registration at unprecedented 570x larger scale than clinical standard

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Image registration
Distributed computing
Kernel optimization
Tensor parallelism

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

image registration
distributed optimization
CUDA kernels
neuroanatomy

Something off? Let us know →

A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis