Glossary of Terms
Adapter Methods
Section titled “Adapter Methods”Small trainable modules inserted into a frozen pre-trained model (e.g., between transformer layers) to specialize it for a new domain or task. They enable fast, low-cost fine-tuning because only the adapters are updated, while the original weights stay intact.
Artificial Intelligence (AI)
Section titled “Artificial Intelligence (AI)”Computer systems that perform tasks requiring human-like capabilities such as pattern recognition, reasoning, and learning. In synthetic data, AI models learn the statistical structure of real datasets and then generate new samples that follow the same patterns without copying specific records.
Auxiliary Models
Section titled “Auxiliary Models”Specialized helper models that improve the outputs of a primary model. In synthetic data pipelines they can auto-label, filter low-quality samples, de-duplicate, or enforce constraints—raising quality at scale without constant human review.
Correlations
Section titled “Correlations”Dependencies between variables (e.g., how blood pressure relates to age). High-quality synthetic data preserves key correlations—both simple (pairwise) and, when needed, higher-order—so analyses and downstream models behave similarly to those trained on real data.
Data Contamination
Section titled “Data Contamination”Leakage of evaluation data (or near-duplicates) into the training set, causing misleadingly high scores. For LLMs and generators, contamination can make models appear more capable than they are; rigorous deduplication and clean held-out sets are the antidote.
Differential Privacy
Section titled “Differential Privacy”A formal privacy framework that bounds how much any one person’s data can influence a result. Practically, it injects calibrated randomness during training or querying so outputs are statistically similar whether or not an individual is included, quantified by a privacy budget (ε).
Diffusion Models
Section titled “Diffusion Models”Generative models that learn to turn noise into data by reversing a gradual noising process. Trained to denoise step-by-step, they can produce high-quality images, audio, video, and even tabular samples by starting from pure noise at inference.
Distributions
Section titled “Distributions”Descriptions of how values are spread (e.g., means, variances, tails). Good synthesizers match important parts of the real data distributions—including rare but meaningful regions—so metrics and models derived from synthetic data remain trustworthy.
Foundational Models
Section titled “Foundational Models”Large pre-trained models (often transformer-based) trained on broad, diverse data and adaptable to many tasks with minimal additional training. In synthetic data, they can be prompted or lightly tuned to generate realistic text, code, or multimodal samples.
Federated Learning
Section titled “Federated Learning”A training setup where many devices or institutions collaboratively train a shared model without sharing raw data. Only model updates are exchanged, reducing central exposure while still learning from distributed data.
Few-Shot Learning
Section titled “Few-Shot Learning”Adapting to a new task from a handful of labeled examples. Useful when real data are scarce; few-shot prompts or small fine-tuning runs guide a broadly trained model to the target domain.
Fine-Tuning
Section titled “Fine-Tuning”Continuing training of a pre-trained model on task-specific data to specialize its behavior. In synthetic data contexts, fine-tuning helps match domain vocabulary, formats, constraints, and quality targets with far less compute than training from scratch.
Generative Adversarial Networks
Section titled “Generative Adversarial Networks”Two-network systems where a generator proposes samples and a discriminator tries to detect fakes. Through competition, the generator learns to produce data that follow the real distribution; careful training avoids collapse and preserves diversity.
Generative Models
Section titled “Generative Models”Models that learn a probability distribution from data and then sample from it to create new instances (text, images, tables). Unlike discriminative models that classify, generative models produce plausible new content consistent with the training patterns.
Hallucination
Section titled “Hallucination”When a model outputs confident but unsupported or false content. Common in language generation; mitigation includes retrieval-augmentation, better prompts, calibration, and guardrails.
Held-Out Dataset
Section titled “Held-Out Dataset”A subset of real data set aside before training and never touched during modeling or synthesis. It provides an unbiased check on utility and overfitting (e.g., evaluating TSTR performance or monitoring distributional drift).
Hyperparameters
Section titled “Hyperparameters”Settings chosen before training (learning rate, batch size, architecture depth, noise schedule, DP ε, etc.). They control training dynamics and the realism/privacy trade-offs of a synthesizer and must be tuned deliberately.
Jailbreak Prompts
Section titled “Jailbreak Prompts”Inputs crafted to bypass safety policies and elicit restricted or harmful outputs. They’re used in red-teaming to probe weaknesses; robust prompt handling, content filters, and response grounding help defend against them.
K-Anonymity
Section titled “K-Anonymity”A property where each record is indistinguishable from at least k-1 others on quasi-identifiers (e.g., {age, ZIP, gender}). Achieved via generalization or suppression, it reduces re-identification risk but doesn’t prevent attribute inference on homogeneous groups.
Large Language Models (LLMs)
Section titled “Large Language Models (LLMs)”Transformer-based models trained on massive text corpora that can follow instructions and generate fluent text. For synthetic data, LLMs can create labeled examples, simulate rare cases, or produce structured records when guided by schemas and constraints.
Learning-Based Synthesis
Section titled “Learning-Based Synthesis”Synthetic data generation that trains models on real data to learn its structure and then sample new records. It contrasts with rule-based scripts or physics simulators and includes methods like VAEs, GANs, diffusion, copulas, and transformers.
LLM-as-a-Judge
Section titled “LLM-as-a-Judge”Using a (usually separate) language model to score or critique another model’s outputs on dimensions like correctness, style, safety, or schema adherence. It enables scalable quality control, often combined with human spot-checks.
LoRA (Low-Rank Adaptation)
Section titled “LoRA (Low-Rank Adaptation)”A PEFT method that injects small low-rank matrices into a model’s layers and trains only those. It achieves strong task adaptation with minimal memory and compute while leaving base weights frozen.
Model Collapse
Section titled “Model Collapse”Quality and diversity degrade when new models are trained primarily on outputs of prior models rather than real data. Over time, rare patterns vanish and artifacts amplify; mitigation includes mixing real data, de-duplication, and quality filters.
Multi-Modal Models
Section titled “Multi-Modal Models”Models that understand and generate across multiple data types (e.g., text↔image, text↔tables, audio+video). They can synthesize richer datasets by respecting cross-modal alignments (e.g., captions matching images or vitals aligned with clinical notes).
Parameter-Efficient Fine-Tuning (PEFT)
Section titled “Parameter-Efficient Fine-Tuning (PEFT)”A family of methods (e.g., LoRA, adapters, prefix-tuning) that update a small fraction of parameters to adapt a large model. PEFT reduces cost and enables maintaining multiple task variants without retraining the full model.
Privacy-Preserving
Section titled “Privacy-Preserving”Designs and techniques that reduce the risk of exposing individuals while keeping data useful (e.g., differential privacy, k-anonymity families, secure enclaves, federated learning, careful sampling). Synthetic data are not inherently private; privacy depends on the training process, safeguards, and evaluation.
Retrieval-Augmented Generation (RAG)
Section titled “Retrieval-Augmented Generation (RAG)”Combining generation with live retrieval from knowledge sources. The model pulls relevant passages (search, vector DB) and conditions on them, improving factual accuracy, freshness, and citation in synthetic text generation.
Red-Teaming
Section titled “Red-Teaming”Structured adversarial testing to uncover failure modes—security, safety, privacy, and quality. For synthesis systems, red-teaming targets leakage, bias amplification, constraint violations, and invalid schema generation.
Statistical Properties
Section titled “Statistical Properties”Quantities that summarize data behavior—moments (mean, variance, skew), dependence (correlations, mutual information), and shape (tails, multimodality). Preserving the right properties is key to downstream validity and trustworthy analytics.
Synthesizer
Section titled “Synthesizer”An algorithm—often a trained model—in a synthetic-data pipeline that learns from real data to generate new records that mirror key statistical properties and structural relationships. In this primer, synthesizers fall into three families: Classical Modelling (statistical + traditional ML, e.g., parametric distributions, Bayesian networks, copulas, decision-tree/clustering models), Generative Models (deep learning, e.g., GANs, VAEs, diffusion, autoregressive transformers, normalizing flows), and Foundation-Model–based Generation (large pre-trained models—LLMs, ViTs, multimodal FMs—adapted via prompting or fine-tuning).
Semantic Relationships
Section titled “Semantic Relationships”Meaningful links that must hold across fields or entities (e.g., diagnosis ↔ medication, transaction date ≥ account open date). High-quality synthesis respects such constraints so records remain coherent and useful.
Temporal Dynamics
Section titled “Temporal Dynamics”Patterns that evolve over time (seasonality, trends, lagged effects). Time-aware synthesizers model sequences and intervals so forecasts, interventions, and longitudinal analyses behave realistically.
Training Data
Section titled “Training Data”The data used to fit a model’s parameters. In synthesis, training data define the target distribution; careful curation (deduping, de-biasing, contamination checks) shapes both realism and privacy risk.
Transformers
Section titled “Transformers”Neural networks built around self-attention, which learns which tokens (or patches) should influence each other. They scale well and power many state-of-the-art generators and LLMs used for synthetic text, images, and structured data.
Zero-Shot Learning
Section titled “Zero-Shot Learning”A model’s ability to perform a task it wasn’t explicitly trained for. Broad pretraining plus clear instructions (prompts) lets the model generalize to unseen labels, formats, or schemas.
Zero-Shot Prompting
Section titled “Zero-Shot Prompting”Issuing task instructions to a model without providing examples. The model relies on its general knowledge to comply; structure, constraints, or retrieval can be added to improve accuracy and reduce hallucinations.