Differentially Private Domain Discovery
Vinod Raman, Travis Dick, Matthew Joseph
Abstract
We study several problems in differentially private domain discovery, where each user holds a subset of items from a shared but unknown domain, and the goal is to output an informative subset of items. For set union, we show that the simple baseline Weighted Gaussian Mechanism (WGM) has a near-optimal $\ell_1$ missing mass guarantee on Zipfian data as well as a distribution-free $\ell_\infty$ missing mass guarantee. We then apply the WGM as a domain-discovery precursor for existing known-domain algorithms for private top-$k$ and $k$-hitting set and obtain new utility guarantees for their unknown domain variants. Finally, experiments demonstrate that all of our WGM-based methods are competitive with or outperform existing baselines for all three problems.
WGM-based methods provide efficient domain discovery with near-optimal guarantees for missing mass on Zipfian data.
- Proves WGM has near-optimal ell1 missing mass guarantee on Zipfian data
- Distribution-free ell-infinity missing mass guarantee for WGM
- Applies WGM as precursor for private top-k and k-hitting set algorithms with new utility guarantees
- Competitive or outperforming baseline methods across all three domain discovery problems
- Weighted Gaussian Mechanism
- Differential privacy
- Domain discovery
- Steam Games
- Amazon Magazine
- Amazon Pantry
Authors did not state explicit limitations.
Close gaps between upper and lower bounds for top-k and k-hitting set problems
from the paperExtend data-dependent subsampling strategies from Chen et al. 2025 to missing mass
from the paper
Author keywords
- Differential Privacy
- Partition Selection
- Top-k Selection
Related orals
LLM Fingerprinting via Semantically Conditioned Watermarks
Introduces semantically conditioned watermarks for robust and stealthy LLM fingerprinting robust to deployment scenarios.
Steering the Herd: A Framework for LLM-based Control of Social Learning
Framework studying strategic control of social learning by algorithmic information mediators with theoretical analysis and LLM-based simulations.
Every Language Model Has a Forgery-Resistant Signature
Ellipse signatures function as forgery-resistant model output identifiers based on high-dimensional geometric constraints.
Gaussian certified unlearning in high dimensions: A hypothesis testing approach
Analyzes machine unlearning in high dimensions showing single noisy Newton step with Gaussian noise suffices for privacy-accuracy.
What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data
WIMHF uses sparse autoencoders to extract human-interpretable features from preference data, enabling better understanding and curation of human feedback.