ICLR 2026 Orals

InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

Haotian Ye, Qiyuan He, Jiaqi Han, Puheng Li, Jiaojiao Fan, Zekun Hao, Fitsum Reda, Yogesh Balaji, Huayu Chen, Sheng Liu, Angela Yao, James Zou, Stefano Ermon, Haoxiang Wang, Ming-Yu Liu

LLMs & Reasoning Sat, Apr 25 · 10:54 AM–11:04 AM · 204 A/B Avg rating: 7.33 (6–8)
Author-provided TL;DR

This paper introduces InfoTok, an adaptive video tokenizer guided by information theory, which significantly boosts video compression efficiency and reduces computational overhead without degrading visual quality.

Abstract

Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon's information theory, this paper introduces \alg, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving $20\%$ tokens without influence on performance, and achieving $2.3\times$ compression rates while still outperforming prior heuristic adaptive approaches. By allocating tokens according to informational richness, \alg enables a more compressed yet accurate tokenization for video representation, offering valuable insights for future research.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

InfoTok achieves adaptive video tokenization using information-theoretic compression and ELBO-based routing.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Evidence lower bound-based algorithm for adaptive video tokenization approaching theoretical optimality
  • Transformer-based adaptive compressor enabling adaptive token allocation by informational richness
  • 20% token saving without performance degradation and 2.3x compression rates
  • Framework generalizable beyond video to audio, 3D, and other modalities
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Information theory
  • ELBO-based routing
  • Variable auto-encoder
  • Transformer-based compression
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • ELBO-based router introduces modest computational overhead requiring additional decoder pass
    from the paper
  • Evaluation primarily focused on reconstruction fidelity rather than downstream applications
    from the paper
  • Did not extend experiments to video generation or action understanding tasks
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Explore lighter-weight router mechanisms estimating complexity from encoder representations
    from the paper
  • Investigate impact on downstream applications like video generation and video understanding
    from the paper

Author keywords

  • discrete tokenization
  • video representation
  • eficiency
  • information theory

Related orals

Something off? Let us know →