ICLR 2026 Orals

InfoNCE Induces Gaussian Distribution

Roy Betser, Eyal Gofer, Meir Yossef Levi, Guy Gilboa

Theory & Optimization Sat, Apr 25 · 10:54 AM–11:04 AM · 203 A/B Avg rating: 4.00 (2–8)
Author-provided TL;DR

Contrastive learning based representations can be well approximated by a multivariate Gaussian distribution.

Abstract

Contrastive learning has become a cornerstone of modern representation learning, allowing training with massive unlabeled data for both task-specific and general (foundation) models. A prototypical loss in contrastive training is InfoNCE and its variants. In this work, we show that the InfoNCE objective induces Gaussian structure in representations that emerge from contrastive training. We establish this result in two complementary regimes. First, we show that under certain alignment and concentration assumptions, projections of the high-dimensional representation asymptotically approach a multivariate Gaussian distribution. Next, under less strict assumptions, we show that adding a small asymptotically vanishing regularization term that promotes low feature norm and high feature entropy leads to similar asymptotic results. We support our analysis with experiments on synthetic and CIFAR-10 datasets across multiple encoder architectures and sizes, demonstrating consistent Gaussian behavior. This perspective provides a principled explanation for commonly observed Gaussianity in contrastive representations. The resulting Gaussian model enables principled analytical treatment of learned representations and is expected to support a wide range of applications in contrastive learning.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Shows InfoNCE loss induces Gaussian distribution in contrastive representations, providing principled explanation for observed Gaussianity.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Proves InfoNCE-trained representations asymptotically approach multivariate Gaussian distribution
  • Shows adding regularization promoting low norm and high entropy yields similar asymptotic results
  • Validates Gaussian behavior across synthetic, CIFAR-10, and pretrained models
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Contrastive learning
  • Information theory
  • Representation learning
  • Gaussian models
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • CIFAR-10
  • MS-COCO
  • ImageNet-R
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Results are asymptotic relying on high-dimensional limits and idealized assumptions
    from the paper
  • Analysis not includes optimization dynamics or proof of training reaching stated minimizers
    from the paper
  • Results characterize population optima under stated assumptions rather than practical training
    from the paper
Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit future directions.

Author keywords

  • Contrastive learning
  • Gaussian distribution
  • InfoNCE

Related orals

Something off? Let us know →