5. Privacy-preserving Data Synthesis

While synthetic data attempts to break direct one-to-one mappings between real and synthetic records, its “synthetic” label alone does not guarantee privacy. Like any AI/ML model, a generator—also called a synthesizer—learns patterns from real data, and under certain conditions, those patterns can leak sensitive or identifiable information.

A context-driven privacy evaluation is therefore essential before sharing or using synthetic data in sensitive scenarios. Privacy risks should be assessed using complementary methods, each providing a different lens on potential leakage.

1. Empirical Privacy Risk Assessment: Smoke Testing

Empirical checks are a quick way to flag potential privacy leaks—situations where synthetic data reveals information about real individuals from the real dataset. These tests look for signs that the synthesizer memorized specific real records instead of learning general patterns.

What is a privacy leak? When synthetic data contains information that could identify or reveal sensitive details about real people from the real dataset. For e.g., if a synthetic patient record is nearly identical to a real patient’s record, it could expose that person’s medical information.

Common “smoke tests” include:

Distance-based measures (similarity/proximity checks): Compare synthetic records directly against the real dataset to find close matches. If many synthetic records closely resemble real ones, this suggests the generator copied rather than learned patterns.
Duplicate detection: Check if any synthetic records are exact copies or near-duplicates of real records. Exact matches are clear privacy violations.
Outlier detection: Identify unusual synthetic records (rare combinations of attributes) and check if they match real outliers. Copying rare cases is particularly risky for privacy.

Figure 1: Example of distance-based privacy evaluation using DCR (distance to closest record) measurements to evaluate the overall distance between real and synthetic data. For every point of synthetic data (blue), the closest point of real data (black) is identified. The DCR (red line) is the distance to that real data point. In this example, there are three columns of data (X, Y and Z); the same DCR metric can be applied to any number of columns.

Image reference:

SDV DCRBaselineProtection

Recent research has demonstrated that DCR consistently fails to identify significant privacy leakage, noting that synthetic datasets deemed “truly anonymous” or “private” by DCR’s pass/fail statistical tests are still highly vulnerable to severe attacks, including Membership Inference Attacks (MIAs) and reconstruction attacks that recover 78%–100% of training outliers. Furthermore, DCR provides a misleading measure of privacy risk because the continuous DCR score shows no meaningful correlation with actual vulnerability to MIAs. Due to these failings, researchers stress the urgent need to move away from using proxy metrics like DCR and instead adopt rigorous privacy evaluation standards, such as end-to-end differential privacy.

Sources:

2. Attack-Based Evaluation: Adversarial Testing

Attack-based evaluations assess privacy risks by testing how vulnerable or safe synthetic data is to privacy attacks. Before interpreting attack results, consider your threat model: what are the attacker’s goals, background knowledge, and available resources?

Common “adversarial tests” include:

Membership inference: Determine whether a target individual’s record was used in training by probing model outputs or analyzing the synthetic release.¹
Attribute inference (sensitive-attribute disclosure): Infer a hidden/sensitive attribute about a target using patterns learned by the synthesizer or models trained on the synthetic data.
Record reconstruction: Recover an approximate training-set record (e.g., a near-duplicate) from model behavior or the synthetic dataset.

Figure 2: Privacy attack goals against synthetic data arranged along an information-gain spectrum: membership inference, attribute inference, and record reconstruction.

Attack-based testing² provides useful risk signals because it works under realistic attacker assumptions, but it must be framed within a defined threat model to avoid over- or underestimating risk.

Anonymeter

is a Python package that helps you test how well your synthetic data protects privacy by simulating different types of attacks.

Guidelines for interpreting privacy test results:

Consider your specific situation: Test results depend on who might attack your data and how they would use it. For e.g., if you’re sharing data publicly, privacy attacks are more concerning than if you’re only using synthetic data internally within your organization.
High scores don’t always mean high risk: A tool might report that an attack “succeeded,” but this could require the attacker to have unrealistic access to information or computational resources. Consider whether the attack scenario actually applies to your real-world situation.
Dataset specifics: Results vary significantly based on data size, dimensionality, and data quality. For e.g., if your real dataset is small or has many unique records, privacy tests will often show high vulnerability regardless of how good your synthetic data generator is. This reflects the inherent challenge of protecting privacy in small, diverse datasets.
Baseline comparisons: Always compare against appropriate baselines (e.g., attacks on real data) to interpret results meaningfully. If attacks succeed equally well on real and synthetic data, the issue may be dataset characteristics rather than privacy leakage.

Additional tools: SDMetrics and SynthCity (for more technical users) offer empirical and attack-based privacy metrics.

3. Differential Privacy (DP): Formal Guarantees

Differentially private

synthesizer training provides a mathematical guarantee that the trained model parameters, and any subsequent model outputs, are relatively unaffected by the addition, removal or change of any single user’s training examples. Unlike empirical or attack-based testing, DP’s protection does not rely on assumptions about attacker behavior.

Figure 3: Synthetic data generation without and with differential privacy (DP). (a) Non-DP: a model learns patterns from real data to sample synthetic records—often high utility but no formal privacy guarantee and potential inference/memorization risks. (b) DP: training uses clipping and calibrated noise (tuned to (ε,δ)), yielding a model and synthetic outputs with a formal individual-level DP guarantee, typically with some utility trade-off.

Image reference:

On Renyi Differential Privacy in Statistics-Based Synthetic Data Generation

The guarantee works by adding carefully calibrated randomness (noise) to the synthesizer’s training process measured by privacy parameters ε (epsilon) and δ (delta). Most practical deployments use (ε,δ)-DP where δ accounts for the small probability of privacy failure. Smaller ε values mean stronger privacy but typically lower utility. Intuitively, DP ensures that each person’s data has almost no impact on the final synthetic dataset, making it highly uncertain whether their information was included at all.

When a synthesizer protects privacy using DP while generating data, that process is differentially private synthetic data generation (DP-SDG).

An Illustrative Example of Using DP for LLM Fine-tuning For Text Generation

Figure 4: Researcher from Microsoft fine-tuned an LLM with DP on private data corpus. The model can be used to generate synthetic examples that resemble the private corpus.

Image reference:

Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe

1. Israel’s Ministry of Health (2014 Live Births Dataset): In February 2024, Israel’s Ministry of Health released a DP synthetic dataset of its National Registry of singleton live births from 2014. The dataset, designed in collaboration with researchers and stakeholders, is protected by DP with an ε = 9.98, balancing usability with privacy safeguards. Read more in Real Deployments chapter.

2. Microsoft & International Organization for Migration (IOM): In late 2022, the IOM and Microsoft Research partnered to release the Global Victim–Perpetrator DP synthetic dataset, a privacy-preserving dataset capturing cases of human trafficking. It reflects statistical properties of the real sensitive records without exposing identities.

3. UNHCR Registration Data: UNHCR applied DP synthetic data to registration datasets of displaced individuals, enabling secure sharing of sensitive refugee data for humanitarian program planning. The organization found this methodology particularly effective for full enumeration datasets and developed a practical guide for implementation.

4. Google’s On-Device Safety Classifier: In 2024, Google published an approach using DP-SDG to train on-device safety classifiers for language models. By using DP, they ensure that the synthetic data matches sensitive real data without risking user privacy.

Synthetic data works best as part of a layered privacy strategy. Rather than relying on generation alone, combine multiple techniques to reduce single points of failure. Some approaches include:

Federated Learning (FL) keeps raw data at its source: training runs locally, and only model parameter updates (i.e., updates to the synthesizer) are shared centrally (combined at a central server), reducing exposure risk. Multiple organizations can collaboratively train a synthesizer without directly sharing their datasets.

Privacy Filters reduce risk before or after synthesis. Before SDG, by suppressing rare values (removing uncommon entries), dropping PIIs, generalizing identifiers (making specific details less precise), or removing outliers that could create unique signatures in the training data. After SDG, by filtering synthetic records that too closely match real ones, removing unrealistic combinations (unrealistic data patterns), or applying additional anonymization techniques (e.g., k-anonymity grouping, noise addition, or geographic rounding).

Refer to our detailed case study on Google’s Gboard deployment for an example of how synthetic data, FL, DP and privacy filters can be combined to train on-device typing correction models while preserving user privacy.

Privacy isn’t a binary yes-or-no question—it’s about managing risk based on your specific context and threat model. Start with simple smoke tests to catch obvious issues, use attack-based evaluations for realistic threat assessment, and consider DP-SDG when you need formal guarantees with strong privacy protection.

The next chapter shifts focus to exploring practical synthesis considerations by taking a systematic, pipeline-driven approach that helps you generate useful synthetic data for your specific applications.

Membership inference is trickier for data such as images, videos and audio. For e.g., even if a synthetic image is far away (in terms of some distance metric) from a real image, it could be perceptually similar, leaking the sensitive identity information. So “perceptual similarity” evaluation is critical for such data. ↩
In a recent empirical work, the authors systematically performed various attacks on synthetic data with differential privacy guarantees and demonstrated that it can still leak sensitive information. However, we argue that gaps in how differential privacy was applied may have contributed to these failures and then there is lack of details on the implementation. ↩

5. Privacy-preserving Data Synthesis

1. Empirical Privacy Risk Assessment: Smoke Testing

2. Attack-Based Evaluation: Adversarial Testing

3. Differential Privacy (DP): Formal Guarantees

An Illustrative Example of Using DP for LLM Fine-tuning For Text Generation

Footnotes