4. Quality Evaluation of Synthetic Data

As OpenAI’s Greg Brockman stated: “Evals are surprisingly often all you need ” This applies directly to synthetic data—evaluation determines whether generated data meets real-world needs.

Standardized evaluation typically covers utility (downstream usefulness), fidelity (statistical similarity) and privacy. However, getting a practical, detailed, and tailored assessment requires broader evaluation. There is no universal definition of “good” synthetic data: what counts as good depends on the use case (e.g., anomaly detection needs minority/outlier fidelity) and the domain’s obligations (e.g., government, healthcare, finance require rigorous privacy testing).

What Quality Dimensions Should You Evaluate?

Here’s a practical set of evaluation dimensions to help you choose strategies that match your specific context:

Dimension	What’s being evaluated?	Example Approaches
Fidelity (Similarity)	How closely does synthetic data mimic the real data distribution?	Compare statistics, check correlations, visual comparisons (e.g., Kolmogorov-Smirnov test, correlation matrices)
Utility (Usefulness)	How useful is synthetic data for downstream analysis (machine learning, statistical inference, etc.)?	Train models on both datasets, compare performance results (e.g., accuracy, F1-score, AUC)
Integrity & Domain Coherence	Does synthetic data preserve format, follow business rules, and make logical domain-specific sense?	Check data formats, validate business rules, expert review (e.g., schema validation, constraint checks)
Diversity & Coverage	Does synthetic data capture the full variability of the real dataset, including rare events and outliers?	Check rare cases, measure variety, count unique values (e.g., coverage metrics, outlier detection)
Fairness & Bias Mitigation	How does synthetic data handle representation across sensitive attributes or minority groups?	Compare group representation, check for balanced outcomes (e.g., demographic parity, equalized odds)
Generalization¹	Do patterns learned by a synthesizer transfer to unseen real-world data, domains, and time periods?	Test on new data, validate across time periods (e.g., cross-validation, temporal validation)
Privacy	What are the risks of exposing sensitive information?	Test for data leakage, check individual privacy protection (e.g., membership inference, reconstruction attacks, attribute inference)

Figure: 1. Evaluation Dimensions and Trade-offs in Synthetic Data Generation. The seven evaluation dimensions show directional relationships where solid arrows indicate positive influences (improving the source dimension fosters the target dimension) and dashed arrows show conflicting influences (improving the source dimension may compromise the target).

Owing to the privacy risk evaluation importance in public-sector and other high-stakes settings, we have a dedicated chapter on it Privacy-preserving Synthesis.

How to Select Metrics for Dimension Evaluation?

Metrics help you measure the quality of synthetic data across different dimensions, which then helps you determine if it will meet your specific needs. When selecting metrics for evaluation across the dimensions above, ensure your evaluation metrics have these characteristics:

Multidimensional coverage: Assesses multiple quality aspects together such as fidelity, diversity, and fairness to spot potential conflicts early (e.g., when improving privacy hurts utility).
Clear interpretability: Produces results that clearly communicate strengths, weaknesses, and trade-offs (what the metrics can measure, and what they fail to measure) to both technical teams and business stakeholders.
Granular assessment: Goes beyond overall scores to examine how well synthetic data represents different subgroups and whether it contains unwanted biases or vulnerabilities.
Reliable consistency: Generates stable, reproducible results by accounting for the randomness in synthetic data generation
Practical flexibility: Works effectively even with limited real data for comparison, adapting to your actual constraints and dataset size.

Beyond statistical similarity, practitioners need to answer a fundamental question: Will this synthetic data actually help train useful models (e.g., for regression or classification tasks) or support accurate analysis? And high fidelity doesn’t guarantee high utility, meaning synthetic data might preserve overall statistical distributions but miss subtle patterns that drive AI/ML model performance. This is where AI/ML utility evaluation becomes essential.

Three Core Testing Approaches

Different evaluation settings call for different methods. Depending on whether the goal is substitution, fidelity checking, or augmentation, practitioners use three complementary testing methodologies:

Train-Synthetic-Test-Real (TSTR)
Train models exclusively on synthetic data, then evaluate on held-out real data. This provides the clearest test of whether synthetic data can substitute for real data. Always compare against a Train-Real-Test-Real (TRTR) baseline using the same model architecture and hyperparameters. If TSTR performance approaches TRTR performance on the specific task, meaningful utility has been demonstrated.
Train-Real-Test-Synthetic (TRTS)
Train models on real data, then evaluate on synthetic data. While less common for utility assessment, this approach serves as a valuable fidelity check—it tests whether synthetic samples reflect the same patterns models learned from real data.
Hybrid Training (Augmentation)
Combine synthetic data with real data during training, then compare performance against a real-only baseline on held-out test sets. This approach is particularly valuable when real data is scarce or imbalanced. If augmented models (Real+Synthetic) outperform real-only models on held-out data, practical augmentation value has been demonstrated.

Figure 2: The Train-Synthetic-Test-Real Methodology

Image reference:

Evaluate synthetic data quality using downstream ML

These same testing frameworks also support privacy evaluation. By training classifiers to distinguish between real and synthetic data, organizations can measure how difficult it is to detect synthetic samples—a key privacy indicator. The harder it is to distinguish synthetic from real data, the better the privacy protection. For detailed detection metrics and implementation examples, see SDV Detection Metrics.

Getting Started With Tabular Synthetic Data Evaluation

Several open-source Python packages provide comprehensive evaluation capabilities for getting started:

SDMetrics

Repository: https://github.com/sdv-dev/SDMetrics

from sdmetrics.reports.single_table import QualityReport

# generate quality report
qr = QualityReport()
qr.generate(real_data=real_data, synthetic_data=synthetic_data, metadata=metadata)

qr_details = qr.get_details(property_name='Column Shapes')
qr_fig = qr.get_visualization(property_name='Column Shapes')

MOSTLY AI Quality Assurance (QA)

Repository: https://github.com/mostly-ai/mostlyai-qa

from mostlyai import qa

# generate quality report for single-table real and synthetic datasets
report_path, metrics = qa.report(
    syn_tgt_data = synthetic_data,
    trn_tgt_data = training_data,
    hol_tgt_data = holdout_data,  # optional
)

# pretty print metrics
print(metrics.model_dump_json(indent=4))

# open up HTML report in new browser window
webbrowser.open(f"file://{report_path.absolute()}")

YData Profiling

Repository: https://github.com/ydataai/ydata-profiling

from ydata_profiling import ProfileReport

# generate profiles for real and synthetic datasets
real_profile = ProfileReport(real_data, title="Real Data Profile")
synthetic_profile = ProfileReport(synthetic_data, title="Synthetic Data Profile")

# compare profiles
comparison_report = real_profile.compare(synthetic_profile)

# save comparison report
comparison_report.to_file("comparison_report.html")

The next chapter shifts focus to one of synthetic data’s most crucial and complex challenges: evaluating the privacy risks of synthetic data.

Related to generalization, measuring flexibility introduced in the article “The fundamental trilemma of synthetic data generation”, which refers to how many use cases the synthetic data can address. Generally speaking, the higher the generalization, the greater the flexibility. ↩