TileLang: Bridge Programmability and Performance in Modern Neural Kernels
Lei Wang, Yu Cheng, Yining Shi, Zhiwen Mo, Zhengju Tang, Wenhao Xie, Tong Wu, Lingxiao Ma, Yuqing Xia, Jilong Xue, Fan Yang, Zhi Yang
We introduce TileLang, a controllable programming system for fused neural kernels.
Abstract
Modern AI algorithms increasingly adopt fused kernels for performance, but implementing them remains complex due to the lack of fine-grained control in existing compilers like Triton. We introduce TileLang, a controllable programming system for fused neural kernels. TileLang provides explicit tile-level primitives for memory placement, data movement, and parallel scheduling. To guide developers in hardware-aware programming, the TileLang introduces two key techniques: tile inference which models tile programs as fused graphs and automatically deduces tile configuration from partial annotations; and tile recommendation that suggests efficient tile configurations based on hardware profiles and heuristics. TileLang makes it easy to express a wide range of fused attention kernels in under 80 lines of Python code, reducing code size by up to 90% compared to manual implementations. Evaluations show that TileLang achieves up to 5x speedup over Triton on NVIDIA H100 and up to 6 on AMD GPUs, demonstrating its ability to bridge programmability and performance.
TileLang enables hardware-aware fused kernel programming with tile inference and recommendation achieving 5-6x speedup.
- Tile-level programming model with explicit primitives for memory, data movement, and parallel scheduling
- Tile inference that automatically deduces tile configuration from partial annotations via fused graph modeling
- Tile recommendation suggesting efficient configurations from hardware profiles and heuristics
- Graph-based optimization
- Tile-level abstraction
- Hardware profiling
- Configuration inference
Authors did not state explicit limitations.
Authors did not state explicit future directions.
Author keywords
- compiler; AI; programming model
Related orals
Probabilistic Kernel Function for Fast Angle Testing
Proposes probabilistic kernel functions for angle testing enabling efficient approximate nearest neighbor search.
SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
Generates minute-long high-resolution videos efficiently with linear attention and constant-memory KV cache for block autoregression.
Efficient Resource-Constrained Training of Transformers via Subspace Optimization
WASI applies subspace-based training to transformer models reducing memory by 62x and FLOPs by 2x while maintaining accuracy on edge devices.
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
Analyzes low-precision flash attention training failure caused by low-rank representations and biased BF16 rounding errors.
Speculative Actions: A Lossless Framework for Faster AI Agents
Speculative Actions accelerates agent systems by predicting and executing likely future actions in parallel.