InfoNCE Induces Gaussian Distribution
Roy Betser, Eyal Gofer, Meir Yossef Levi, Guy Gilboa
Contrastive learning based representations can be well approximated by a multivariate Gaussian distribution.
Abstract
Contrastive learning has become a cornerstone of modern representation learning, allowing training with massive unlabeled data for both task-specific and general (foundation) models. A prototypical loss in contrastive training is InfoNCE and its variants. In this work, we show that the InfoNCE objective induces Gaussian structure in representations that emerge from contrastive training. We establish this result in two complementary regimes. First, we show that under certain alignment and concentration assumptions, projections of the high-dimensional representation asymptotically approach a multivariate Gaussian distribution. Next, under less strict assumptions, we show that adding a small asymptotically vanishing regularization term that promotes low feature norm and high feature entropy leads to similar asymptotic results. We support our analysis with experiments on synthetic and CIFAR-10 datasets across multiple encoder architectures and sizes, demonstrating consistent Gaussian behavior. This perspective provides a principled explanation for commonly observed Gaussianity in contrastive representations. The resulting Gaussian model enables principled analytical treatment of learned representations and is expected to support a wide range of applications in contrastive learning.
Shows InfoNCE loss induces Gaussian distribution in contrastive representations, providing principled explanation for observed Gaussianity.
- Proves InfoNCE-trained representations asymptotically approach multivariate Gaussian distribution
- Shows adding regularization promoting low norm and high entropy yields similar asymptotic results
- Validates Gaussian behavior across synthetic, CIFAR-10, and pretrained models
- Contrastive learning
- Information theory
- Representation learning
- Gaussian models
- CIFAR-10
- MS-COCO
- ImageNet-R
Results are asymptotic relying on high-dimensional limits and idealized assumptions
from the paperAnalysis not includes optimization dynamics or proof of training reaching stated minimizers
from the paperResults characterize population optima under stated assumptions rather than practical training
from the paper
Authors did not state explicit future directions.
Author keywords
- Contrastive learning
- Gaussian distribution
- InfoNCE
Related orals
On The Surprising Effectiveness of a Single Global Merging in Decentralized Learning
Shows decentralized learning with single global merging achieves convergence rates matching parallel SGD under data heterogeneity.
Non-Convex Federated Optimization under Cost-Aware Client Selection
Develops efficient federated optimization algorithm with cost-aware client selection achieving best communication and local complexity.
Fast Escape, Slow Convergence: Learning Dynamics of Phase Retrieval under Power-Law Data
Analyzes phase retrieval learning dynamics with anisotropic data, deriving explicit scaling laws and three-phase trajectories.
A Representer Theorem for Hawkes Processes via Penalized Least Squares Minimization
Representer theorem for Hawkes processes shows dual coefficients are analytically fixed to unity via penalized least squares.
Quantitative Bounds for Length Generalization in Transformers
Quantitative bounds show training length required for length generalization depends on periodicity, locality, alphabet size, and model norms.