ICLR 2026 Orals

High-dimensional Analysis of Synthetic Data Selection

Parham Rezaei, Filip Kovačević, Francesco Locatello, Marco Mondelli

Uncategorized Thu, Apr 23 · 3:15 PM–3:25 PM · 201 A/B Avg rating: 5.33 (4–6)
Author-provided TL;DR

We give a precise analysis for the problem of synthetic data selection through the lens of high-dimensional regression, and we translate the theoretical insights into a method that performs well in practice.

Abstract

Despite the progress in the development of generative models, their usefulness in creating synthetic data that improve prediction performance of classifiers has been put into question. Besides heuristic principles such as ''synthetic data should be close to the real data distribution'', it is actually not clear which specific properties affect the generalization error. Our paper addresses this question through the lens of high-dimensional regression. Theoretically, we show that, for linear models, the *covariance shift* between the target distribution and the distribution of the synthetic data affects the generalization error but, surprisingly, the mean shift does not. Furthermore, in some regimes, we prove that matching the covariance of the target distribution is optimal. Remarkably, the theoretical insights for linear models carry over to deep neural networks and generative models. We empirically demonstrate that the *covariance matching* procedure (matching the covariance of the synthetic data with that of the data coming from the target distribution) performs well against several recent approaches for synthetic data selection, across various training paradigms, datasets and generative models used for augmentation.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Demonstrates that covariance matching procedure improves synthetic data quality for training neural networks better than mean shift or other approaches.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Theoretical analysis showing covariance shift, not mean shift, affects generalization error in linear models
  • Proves matching covariance of synthetic data with target distribution is optimal in some regimes
  • Empirical validation of covariance matching across datasets, generative models, and training paradigms
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • High-dimensional regression analysis
  • Covariance matching
  • Generative models
  • Deep learning
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Extend analysis to multiple Gaussian mixtures
    from the paper
  • Introduce model shift between synthetic and real samples
    from the paper
  • Study uncertainty calibration, differential privacy, fairness, and prediction-powered causal inference
    from the paper

Author keywords

  • high dimensional regression
  • empirical risk minimization
  • synthetic data
  • generative models

Related orals

Information Shapes Koopman Representation

Proposes information-theoretic Lagrangian formulation to balance simplicity and expressiveness in Koopman representation learning for dynamical systems.

Avg rating: 5.50 (4–6) · Xiaoyuan Cheng et al.
Something off? Let us know →