InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

Haotian Ye, Qiyuan He, Jiaqi Han, Puheng Li, Jiaojiao Fan, Zekun Hao, Fitsum Reda, Yogesh Balaji, Huayu Chen, Sheng Liu, Angela Yao, James Zou, Stefano Ermon, Haoxiang Wang, Ming-Yu Liu

LLMs & Reasoning Sat, Apr 25 · 10:54 AM–11:04 AM · 204 A/B Avg rating: 7.33 (6–8)

OpenReview ↗ arXiv ↗ PDF ↗ iclr.cc ↗

Author-provided TL;DR

This paper introduces InfoTok, an adaptive video tokenizer guided by information theory, which significantly boosts video compression efficiency and reduces computational overhead without degrading visual quality.

Abstract

Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon's information theory, this paper introduces \alg, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving $20\%$ tokens without influence on performance, and achieving $2.3\times$ compression rates while still outperforming prior heuristic adaptive approaches. By allocating tokens according to informational richness, \alg enables a more compressed yet accurate tokenization for video representation, offering valuable insights for future research.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

InfoTok achieves adaptive video tokenization using information-theoretic compression and ELBO-based routing.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Evidence lower bound-based algorithm for adaptive video tokenization approaching theoretical optimality
Transformer-based adaptive compressor enabling adaptive token allocation by informational richness
20% token saving without performance degradation and 2.3x compression rates
Framework generalizable beyond video to audio, 3D, and other modalities

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Information theory
ELBO-based routing
Variable auto-encoder
Transformer-based compression

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

ELBO-based router introduces modest computational overhead requiring additional decoder pass
from the paper
Evaluation primarily focused on reconstruction fidelity rather than downstream applications
from the paper
Did not extend experiments to video generation or action understanding tasks
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Explore lighter-weight router mechanisms estimating complexity from encoder representations
from the paper
Investigate impact on downstream applications like video generation and video understanding
from the paper

Author keywords

discrete tokenization
video representation
eficiency
information theory

Something off? Let us know →

InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis