TileLang: Bridge Programmability and Performance in Modern Neural Kernels

Lei Wang, Yu Cheng, Yining Shi, Zhiwen Mo, Zhengju Tang, Wenhao Xie, Tong Wu, Lingxiao Ma, Yuqing Xia, Jilong Xue, Fan Yang, Zhi Yang

Efficiency, Systems & Kernels Thu, Apr 23 · 11:06 AM–11:16 AM · 202 A/B Avg rating: 7.00 (4–8)

OpenReview ↗ PDF ↗ iclr.cc ↗

Author-provided TL;DR

We introduce TileLang, a controllable programming system for fused neural kernels.

Abstract

Modern AI algorithms increasingly adopt fused kernels for performance, but implementing them remains complex due to the lack of fine-grained control in existing compilers like Triton. We introduce TileLang, a controllable programming system for fused neural kernels. TileLang provides explicit tile-level primitives for memory placement, data movement, and parallel scheduling. To guide developers in hardware-aware programming, the TileLang introduces two key techniques: tile inference which models tile programs as fused graphs and automatically deduces tile configuration from partial annotations; and tile recommendation that suggests efficient tile configurations based on hardware profiles and heuristics. TileLang makes it easy to express a wide range of fused attention kernels in under 80 lines of Python code, reducing code size by up to 90% compared to manual implementations. Evaluations show that TileLang achieves up to 5x speedup over Triton on NVIDIA H100 and up to 6 on AMD GPUs, demonstrating its ability to bridge programmability and performance.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

TileLang enables hardware-aware fused kernel programming with tile inference and recommendation achieving 5-6x speedup.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Tile-level programming model with explicit primitives for memory, data movement, and parallel scheduling
Tile inference that automatically deduces tile configuration from partial annotations via fused graph modeling
Tile recommendation suggesting efficient configurations from hardware profiles and heuristics

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Graph-based optimization
Tile-level abstraction
Hardware profiling
Configuration inference

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

compiler; AI; programming model

Something off? Let us know →

TileLang: Bridge Programmability and Performance in Modern Neural Kernels

Abstract

Author keywords

Related orals

Probabilistic Kernel Function for Fast Angle Testing

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Efficient Resource-Constrained Training of Transformers via Subspace Optimization

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

Speculative Actions: A Lossless Framework for Faster AI Agents