6. Practical Data Synthesis - Quality Guidelines from Tabular Perspectives

High-quality data requires more than just picking a synthesizer and running it on real data. Projects often fail when key steps are skipped—defining requirements (how the data will be used and what counts as success), data profiling, selecting an ill-fitting synthesizer, or evaluating narrowly—producing outputs that look convincing but lack utility or carry hidden risks. A systematic, end-to-end process improves results.

This chapter introduces the Synthetic Data Generation (SDG) Pipeline—a practical framework from our experience working on tabular data at GovTech. These guidelines help reflect on what works in practice. While focused on tabular data, many principles extend to other modalities.

Figure 1: The SDG Pipeline. The framework shows eight key stages grouped into three phases: planning and data preparation (yellow), core synthesis (green), and validation and deployment (pink). Dotted arrows indicate the non-linear nature of the pipeline, reflecting real-world SDG projects where common patterns emerge—such as evaluation revealing the need for data preparation refinements, model adjustments, or additional generation cycles.

1. Data Inception (Planning and Scoping)

A strong start sets direction for everything downstream. Before any modeling, assess how the availability, quality, and scale of the real data—and the constraints surrounding it—can shape every downstream decision. Before any technical work, define the use case, threat models, success criteria, and roles, ensuring the plan reflects what is realistically achievable.

Some key actions:

Assess data availability and complexity — confirm data access, scale, and governance rules to anticipate synthesizer and infrastructure needs.
Identify stakeholders (policy, legal, domain experts) and capture their needs early for aligned synthesis goals.
Surface privacy/fairness/regulatory requirements upfront to avoid re-engineering synthesizers later.
Agree on evaluation metrics to guide trade-offs and measure progress.

2. Data Preparation (Profiling and Cleaning)

SDG does not fix dirty data—errors, gaps, and biases propagate or worsen. Similar to any AI/ML task, preparation ensures accuracy, consistency, and representativeness in the real data before synthesizer training. Preparation involves actions like column profiling, resolving inaccuracies, handling missing values, and addressing anomalies.

Some column-level data preparation examples

Column Characteristic	Why It Matters	Guideline	Example
High % missing values	Can produce unrealistic outputs	Analyze missingness mechanism (MCAR, MAR, MNAR); choose appropriate imputation strategy or drop if not feasible	`Income`: 70% missing → impute with median by `job_category` or drop column
High-cardinality categorical	Increases risk of memorization and slows training	Group rare categories; cap unique values	`Job_title`: 2000+ values → group rare ones as “Other_Professional”
Unique/compound identifiers	No statistical meaning; privacy risk	Remove or pseudonymize (replace with non-identifying codes)	`SSN`: “123-45-6789” → remove or hash to “ID_A7B9C2”
Highly correlated columns	Adds redundancy; can inflate feature importance	Merge or remove after verifying correlation isn’t meaningful	`Height_cm` & `Height_inches` (r=1.0) → keep `Height_cm` only
Inconsistent formats & errors	Hides true patterns; inaccurate data propagates to outputs	Standardize formats and units; validate and correct errors	`Dates`: “Jan-1-2023”, “01/01/23” → standardize to “2023-01-01”
Outliers/anomalies	Distorts distributions; can leak rare records	Assess; cap (limit to threshold), remove, or tag	`Age`: 999 → cap at 95th percentile (e.g., 75) or remove
Duplicate rows	Biases learning and wastes compute	Remove; prevent train/test leakage	`Customer_ID`=123 appears 3x → keep 1, remove 2 duplicates
Small categorical groups	Underrepresented in outputs	Merge or oversample (increase representation)	`Country`: 1-2 records each → group as “Other_Countries”
Numerical scaling/skew	Hinders synthesizer learning (for distance/gradient-based methods)	Normalize for neural networks/SVM; transform skewed distributions; skip for tree-based synthesizers	`Income`: $0-$1M → log-transform or normalize to [0,1] scale

3. Data Pre-Processing (Synthesizer-Ready Transformations)

Pre-processing transforms clean data from the data preparation step into formats that synthesizers can learn from effectively. Pre-processing involves actions like column encoding, removing derived/deterministic fields, restructuring constraints, and for time series data, setting the sequence length.

Some column-level data pre-processing examples

Category	Why It Matters	Guideline	Example
Categorical encoding	Encoding choice affects learning and bias	One-hot for nominal; ordinal for ranked; target encoding for high-cardinality	`Gender` → [0,1] vs `Education_level` → [1,2,3,4]
Derived/deterministic columns (calculated from other columns)	Direct synthesis may break relationships due to random nature	Remove calculated fields; recompute after synthesis or enforce in post-processing	Drop `total_price` column; keep `quantity` & `unit_price`
Business constraints and rules	Some values must meet hard rules	Transform data structure to enforce constraints mathematically; reconstruct during post-processing	`start_date < end_date` → convert to [`start_date`, `days_duration`] where duration > 0
Date representation	Must preserve logic	Exact dates (demographics): numerical encoding (e.g., year, month); Relative dates (sequential): anchor earliest date, convert rest to relative offsets	`Birth_year` → 1985; `Transaction_dates` → [0, 15, 23, 45] days from first
Sequential/temporal features	Synthesizers need structured time-aware inputs	Create lag features or use sequence-aware synthesizers	`Previous_month_sales`, `7-day_moving_average`
Special cases (zip codes, IP addresses, etc.)	Require domain-specific handling to preserve accuracy	Apply specialized encoding or use domain-aware synthesizers	`Zip_code` → [`state`, `region`, `population_density`]; `IP` → [`country`, `ISP_type`]

4. Synthesizer Selection and Training

Synthesizer choice has a direct impact on quality and privacy. Consider whether to use open-source, commercial, or in-house solutions, factoring in key considerations like technical expertise, budget, privacy needs, infrastructure, data complexity, and compute capacity. Like any AI/ML modeling, training requires careful setup and monitoring—establish validation metrics, configure compute resources, track training stability, and implement overfitting safeguards (e.g., early stopping, regularization).