Intrinsic Entropy of Context Length Scaling in LLMs
Jingzhe Shi, Qinwei Ma, Hongyi Liu, Hang Zhao, Jenq-Neng Hwang, Lei Li
We propose to use Intrinsic Entropy for understanding impact of context length on Language Modeling, and conduct experiments to validate theoretical assumptions and deductions with language and synthetic datasets.
Abstract
Long Context Language Models have drawn great attention in the past few years. There has been work discussing the impact of long context on Language Model performance: some find that long irrelevant context could harm performance, while some experimentally summarize loss reduction by relevant long context as Scaling Laws. This calls for a more thorough understanding of how long context impacts Language Modeling. In this work, we (1) propose to use `Intrinsic Entropy' for explaining the impact of context length on language modeling; and (2) conduct experiments on natural language and synthetic data, validating our proposed theoretical assumptions and deductions. Our theoretical framework can provide practical insights such as establishing that training dataset size dictates an optimal context length and bounds context length scaling for certain cases. We hope our work may inspire new long context Language Models, as well as future work studying the physics of Language Models.
Theory of context length scaling through Intrinsic Entropy explaining optimal context length and training dataset size relationship.
- Intrinsic Entropy framework for explaining impact of context length on language modeling
- Theoretical analysis establishing linear relation between cross-entropy loss and Intrinsic Entropy
- Finding that optimal context length exists and increases with dataset size in pretraining
- Intrinsic Entropy analysis
- Bayes Risk framework
- Approximation Loss analysis
Theory starting from Intrinsic Entropy only holds with specific assumptions
from the paperExplanation relies on Intrinsic Space perspective which depends on data, neural network, and prediction task
from the paper
Propose more fundamental theories to explain Intrinsic Entropy measurements
from the paper
Author keywords
- context length
- intrinsic entropy
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.