Quantitative Bounds for Length Generalization in Transformers

Zachary Izzo, Eshaan Nichani, Jason D. Lee

Theory & Optimization Thu, Apr 23 · 4:27 PM–4:37 PM · 201 A/B Avg rating: 7.00 (6–8)

Author-provided TL;DR

We provide an upper bound on the length of training sequences required for a transformer to generalize to sequences of arbitrary lengths.

Abstract

We study the problem of length generalization (LG) in transformers: the ability of a model trained on shorter sequences to maintain performance when evaluated on much longer, previously unseen inputs. Prior work by Huang et al. (2024) established that transformers eventually achieve length generalization once the training sequence length exceeds some finite threshold, but left open the question of how large it must be. In this work, we provide the first quantitative bounds on the required training length for length generalization to occur. Motivated by previous empirical and theoretical work, we analyze LG in several distinct problem settings: $\ell_\infty$ error control vs. average error control over an input distribution, infinite-precision softmax attention vs. finite-precision attention (which reduces to an argmax) in the transformer, as well as for one- or two-layer transformers. In all scenarios, we prove that LG occurs when the internal behavior of the transformer on longer sequences can be ``simulated'' by its behavior on shorter sequences seen during training. Our bounds give qualitative estimates for the required length of training data required for a transformer to generalize, and we verify these insights empirically. These results sharpen our theoretical understanding of the mechanisms underlying extrapolation in transformers, and formalize the intuition that richer training data is required for generalization on more complex tasks.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Quantitative bounds show training length required for length generalization depends on periodicity, locality, alphabet size, and model norms.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Provides first quantitative bounds on required training length for length generalization in transformers
Analyzes distinct problem settings including infinite and finite-precision attention with different error control
Shows LG occurs when forward pass on longer sequences can be simulated by behavior on shorter training sequences
Bounds scale with parameter norms, periodicity, locality, alphabet size, and inverse error tolerance

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Transformer analysis
Attention mechanisms
Length extrapolation

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Extend results to transformers with larger depth
from the paper
Relate minimum training length to complexity notions such as C-RASP program length
from the paper
Extend average-case analysis to broader classes of distributions over sequences
from the paper
Characterize how different positional embedding schemes affect minimum training length
from the paper

Author keywords

transformers
theory
length generalization

Something off? Let us know →

Quantitative Bounds for Length Generalization in Transformers

Abstract

Author keywords

Related orals

On The Surprising Effectiveness of a Single Global Merging in Decentralized Learning

Non-Convex Federated Optimization under Cost-Aware Client Selection

Fast Escape, Slow Convergence: Learning Dynamics of Phase Retrieval under Power-Law Data

A Representer Theorem for Hawkes Processes via Penalized Least Squares Minimization

Difficult Examples Hurt Unsupervised Contrastive Learning: A Theoretical Perspective