Difficult Examples Hurt Unsupervised Contrastive Learning: A Theoretical Perspective

Yi-Ge Zhang, Jingyi Cui, Qiran Li, Yisen Wang

Theory & Optimization Sat, Apr 25 · 10:30 AM–10:40 AM · 203 A/B Avg rating: 6.00 (6–6)

Author-provided TL;DR

We introduce a similarity-based theoretical framework that shows how difficult boundary examples impair generalization in unsupervised contrastive learning, and we design mechanisms that address this issue and boost downstream accuracy.

Abstract

Unsupervised contrastive learning has shown significant performance improvements in recent years, often approaching or even rivaling supervised learning in various tasks. However, its learning mechanism is fundamentally different from supervised learning. Previous works have shown that difficult examples (well-recognized in supervised learning as examples around the decision boundary), which are essential in supervised learning, contribute minimally in unsupervised settings. In this paper, perhaps surprisingly, we find that the direct removal of difficult examples, although reduces the sample size, can boost the downstream classification performance of contrastive learning. To uncover the reasons behind this, we develop a theoretical framework modeling the similarity between different pairs of samples. Guided by this framework, we conduct a thorough theoretical analysis revealing that the presence of difficult examples negatively affects the generalization of contrastive learning. Furthermore, we demonstrate that the removal of these examples, and techniques such as margin tuning and temperature scaling can enhance its generalization bounds, thereby improving performance. Empirically, we propose a simple and efficient mechanism for selecting difficult examples and validate the effectiveness of the aforementioned methods, which substantiates the reliability of our proposed theoretical framework.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

EBTs frame System 2 thinking as energy minimization enabling inference-time reasoning emergence across modalities.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Energy-Based Transformers class enabling System 2 thinking through optimization with learned verifier function
Stable and parallelizable training techniques supporting emergence of reasoning capabilities from unsupervised learning
Demonstrate improved generalization and reasoning on out-of-distribution data with 35% faster pre-training scaling

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Energy-based models
Optimization-based inference
Transformer architecture
Unsupervised learning

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

EBTs introduce additional hyperparameters due to optimization process
from the paper
Scaled only to 800M parameters; larger models unexplored due to resource constraints
from the paper
Struggle capturing many modes in highly multimodal distributions like images, often combined with autoregression
from the paper
Lag behind feed-forward Transformers by large margin in FLOP-efficiency posing adoption barrier
from the paper

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Investigate reducing Reversal Curse through EBT training enabling gradient updates on tokens within context
from the paper
Improve training stability to enable more optimization steps for longer thinking process
from the paper
Explore world models combining state and action modeling via joint distribution learning
from the paper
Use EBTs as complementary System 2 backbone to improve lighter feed-forward models
from the paper
Investigate recurrent Energy-Based Models leveraging Mamba architecture for latency-driven cases
from the paper

Author keywords

Machine Learning. Self-Supervised Learning. Difficult Examples

Something off? Let us know →

Difficult Examples Hurt Unsupervised Contrastive Learning: A Theoretical Perspective

Abstract

Author keywords

Related orals

On The Surprising Effectiveness of a Single Global Merging in Decentralized Learning

Non-Convex Federated Optimization under Cost-Aware Client Selection

Fast Escape, Slow Convergence: Learning Dynamics of Phase Retrieval under Power-Law Data

A Representer Theorem for Hawkes Processes via Penalized Least Squares Minimization

Quantitative Bounds for Length Generalization in Transformers