High-dimensional Analysis of Synthetic Data Selection
Parham Rezaei, Filip Kovačević, Francesco Locatello, Marco Mondelli
We give a precise analysis for the problem of synthetic data selection through the lens of high-dimensional regression, and we translate the theoretical insights into a method that performs well in practice.
Abstract
Despite the progress in the development of generative models, their usefulness in creating synthetic data that improve prediction performance of classifiers has been put into question. Besides heuristic principles such as ''synthetic data should be close to the real data distribution'', it is actually not clear which specific properties affect the generalization error. Our paper addresses this question through the lens of high-dimensional regression. Theoretically, we show that, for linear models, the *covariance shift* between the target distribution and the distribution of the synthetic data affects the generalization error but, surprisingly, the mean shift does not. Furthermore, in some regimes, we prove that matching the covariance of the target distribution is optimal. Remarkably, the theoretical insights for linear models carry over to deep neural networks and generative models. We empirically demonstrate that the *covariance matching* procedure (matching the covariance of the synthetic data with that of the data coming from the target distribution) performs well against several recent approaches for synthetic data selection, across various training paradigms, datasets and generative models used for augmentation.
Demonstrates that covariance matching procedure improves synthetic data quality for training neural networks better than mean shift or other approaches.
- Theoretical analysis showing covariance shift, not mean shift, affects generalization error in linear models
- Proves matching covariance of synthetic data with target distribution is optimal in some regimes
- Empirical validation of covariance matching across datasets, generative models, and training paradigms
- High-dimensional regression analysis
- Covariance matching
- Generative models
- Deep learning
Authors did not state explicit limitations.
Extend analysis to multiple Gaussian mixtures
from the paperIntroduce model shift between synthetic and real samples
from the paperStudy uncertainty calibration, differential privacy, fairness, and prediction-powered causal inference
from the paper
Author keywords
- high dimensional regression
- empirical risk minimization
- synthetic data
- generative models
Related orals
Information Shapes Koopman Representation
Proposes information-theoretic Lagrangian formulation to balance simplicity and expressiveness in Koopman representation learning for dynamical systems.
It's All Just Vectorization: einx, a Universal Notation for Tensor Operations
einx is universal notation for tensor operations using vectorization, reducing large APIs to small consistent operation sets.
MrRoPE: Mixed-radix Rotary Position Embedding
MrRoPE generalizes RoPE-extension via radix system conversion, achieving train-short-test-long with doubled effective context window.