Differentially Private Domain Discovery

Vinod Raman, Travis Dick, Matthew Joseph

Safety, Privacy & Alignment Thu, Apr 23 · 3:39 PM–3:49 PM · 201 C Avg rating: 6.50 (6–8)

Abstract

We study several problems in differentially private domain discovery, where each user holds a subset of items from a shared but unknown domain, and the goal is to output an informative subset of items. For set union, we show that the simple baseline Weighted Gaussian Mechanism (WGM) has a near-optimal $\ell_1$ missing mass guarantee on Zipfian data as well as a distribution-free $\ell_\infty$ missing mass guarantee. We then apply the WGM as a domain-discovery precursor for existing known-domain algorithms for private top-$k$ and $k$-hitting set and obtain new utility guarantees for their unknown domain variants. Finally, experiments demonstrate that all of our WGM-based methods are competitive with or outperform existing baselines for all three problems.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

WGM-based methods provide efficient domain discovery with near-optimal guarantees for missing mass on Zipfian data.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Proves WGM has near-optimal ell1 missing mass guarantee on Zipfian data
Distribution-free ell-infinity missing mass guarantee for WGM
Applies WGM as precursor for private top-k and k-hitting set algorithms with new utility guarantees
Competitive or outperforming baseline methods across all three domain discovery problems

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Weighted Gaussian Mechanism
Differential privacy
Domain discovery

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

Steam Games
Amazon Magazine
Amazon Pantry

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Close gaps between upper and lower bounds for top-k and k-hitting set problems
from the paper
Extend data-dependent subsampling strategies from Chen et al. 2025 to missing mass
from the paper

Author keywords

Differential Privacy
Partition Selection
Top-k Selection

Something off? Let us know →

Differentially Private Domain Discovery

Abstract

Author keywords

Related orals

LLM Fingerprinting via Semantically Conditioned Watermarks

Steering the Herd: A Framework for LLM-based Control of Social Learning

Every Language Model Has a Forgery-Resistant Signature

Gaussian certified unlearning in high dimensions: A hypothesis testing approach

What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data