High-dimensional Analysis of Synthetic Data Selection

Parham Rezaei, Filip Kovačević, Francesco Locatello, Marco Mondelli

Uncategorized Thu, Apr 23 · 3:15 PM–3:25 PM · 201 A/B Avg rating: 5.33 (4–6)

Author-provided TL;DR

We give a precise analysis for the problem of synthetic data selection through the lens of high-dimensional regression, and we translate the theoretical insights into a method that performs well in practice.

Abstract

Despite the progress in the development of generative models, their usefulness in creating synthetic data that improve prediction performance of classifiers has been put into question. Besides heuristic principles such as ''synthetic data should be close to the real data distribution'', it is actually not clear which specific properties affect the generalization error. Our paper addresses this question through the lens of high-dimensional regression. Theoretically, we show that, for linear models, the *covariance shift* between the target distribution and the distribution of the synthetic data affects the generalization error but, surprisingly, the mean shift does not. Furthermore, in some regimes, we prove that matching the covariance of the target distribution is optimal. Remarkably, the theoretical insights for linear models carry over to deep neural networks and generative models. We empirically demonstrate that the *covariance matching* procedure (matching the covariance of the synthetic data with that of the data coming from the target distribution) performs well against several recent approaches for synthetic data selection, across various training paradigms, datasets and generative models used for augmentation.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Demonstrates that covariance matching procedure improves synthetic data quality for training neural networks better than mean shift or other approaches.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Theoretical analysis showing covariance shift, not mean shift, affects generalization error in linear models
Proves matching covariance of synthetic data with target distribution is optimal in some regimes
Empirical validation of covariance matching across datasets, generative models, and training paradigms

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

High-dimensional regression analysis
Covariance matching
Generative models
Deep learning

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Extend analysis to multiple Gaussian mixtures
from the paper
Introduce model shift between synthetic and real samples
from the paper
Study uncertainty calibration, differential privacy, fairness, and prediction-powered causal inference
from the paper

Author keywords

high dimensional regression
empirical risk minimization
synthetic data
generative models

Something off? Let us know →

High-dimensional Analysis of Synthetic Data Selection

Abstract

Author keywords

Related orals

Information Shapes Koopman Representation

It's All Just Vectorization: einx, a Universal Notation for Tensor Operations

MrRoPE: Mixed-radix Rotary Position Embedding