SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Junsong Chen, Yuyang Zhao, Jincheng YU, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie

Efficiency, Systems & Kernels Fri, Apr 24 · 10:42 AM–10:52 AM · 201 A/B Avg rating: 6.50 (6–8)

OpenReview ↗ arXiv ↗ PDF ↗ iclr.cc ↗

Abstract

We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720×1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1\% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x} speedup). In summary, SANA-Video enables low-cost, high-quality video generation. Code and model will be publicly released.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Generates minute-long high-resolution videos efficiently with linear attention and constant-memory KV cache for block autoregression.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Introduces Linear DiT leveraging linear attention for efficient video generation
Designs constant-memory KV cache for block linear attention enabling minute-long video generation
Explores effective data filters and training strategies achieving 12-day training cost
Achieves 16x faster latency than competing small diffusion models with competitive performance

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Linear attention
Diffusion transformers
Block autoregression

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

Video Diffusion Model

Something off? Let us know →

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Abstract

Author keywords

Related orals

TileLang: Bridge Programmability and Performance in Modern Neural Kernels

Probabilistic Kernel Function for Fast Angle Testing

Efficient Resource-Constrained Training of Transformers via Subspace Optimization

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

Speculative Actions: A Lossless Framework for Faster AI Agents