InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression
Haotian Ye, Qiyuan He, Jiaqi Han, Puheng Li, Jiaojiao Fan, Zekun Hao, Fitsum Reda, Yogesh Balaji, Huayu Chen, Sheng Liu, Angela Yao, James Zou, Stefano Ermon, Haoxiang Wang, Ming-Yu Liu
This paper introduces InfoTok, an adaptive video tokenizer guided by information theory, which significantly boosts video compression efficiency and reduces computational overhead without degrading visual quality.
Abstract
Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon's information theory, this paper introduces \alg, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving $20\%$ tokens without influence on performance, and achieving $2.3\times$ compression rates while still outperforming prior heuristic adaptive approaches. By allocating tokens according to informational richness, \alg enables a more compressed yet accurate tokenization for video representation, offering valuable insights for future research.
InfoTok achieves adaptive video tokenization using information-theoretic compression and ELBO-based routing.
- Evidence lower bound-based algorithm for adaptive video tokenization approaching theoretical optimality
- Transformer-based adaptive compressor enabling adaptive token allocation by informational richness
- 20% token saving without performance degradation and 2.3x compression rates
- Framework generalizable beyond video to audio, 3D, and other modalities
- Information theory
- ELBO-based routing
- Variable auto-encoder
- Transformer-based compression
ELBO-based router introduces modest computational overhead requiring additional decoder pass
from the paperEvaluation primarily focused on reconstruction fidelity rather than downstream applications
from the paperDid not extend experiments to video generation or action understanding tasks
from the paper
Explore lighter-weight router mechanisms estimating complexity from encoder representations
from the paperInvestigate impact on downstream applications like video generation and video understanding
from the paper
Author keywords
- discrete tokenization
- video representation
- eficiency
- information theory
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.