References
This page contains all references cited throughout the Synthetic Data Primer, along with additional resources for readers to explore the topic.
1. Research Papers
Section titled “1. Research Papers”-
Comprehensive Exploration of Synthetic Data Generation: A Survey
-
A Systematic Review of Synthetic Data Generation Techniques Using Generative AI
-
Synthetic Data in AI: Challenges, Applications, and Ethical Implications
-
Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data
-
A Hyperparameter Tuning Framework for Tabular Synthetic Data Generation Methods
-
Navigating Tabular Data Synthesis Research: Understanding User Needs and Tool Capabilities
-
What’s Wrong with Your Synthetic Tabular Data? Using Explainable AI to Evaluate Generative Models
-
How Well Does Your Tabular Generator Learn the Structure of Tabular Data?
-
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey
-
Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Understanding - A Survey
-
Are LLMs Naturally Good at Synthetic Tabular Data Generation?
-
Harnessing Large-Language Models to Generate Private Synthetic Text
-
Synthetic Data Generation Using Large Language Models: Advances in Text and Code
-
LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation
-
A Survey of LLM-Based Methods for Synthetic Data Generation and the Rise of Agentic Workflows
-
Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World
-
AI Models Collapse When Trained on Recursively Generated Data
-
A Note on Shumailov et al. (2024): ‘AI Models Collapse When Trained on Recursively Generated Data’
-
The Curse of Recursion: Training on Generated Data Makes Models Forget
-
Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data
-
Differentially Private Release of Israel’s National Registry of Live Births
-
On the Challenges of Deploying Privacy-Preserving Synthetic Data in the Enterprise
-
Differentially Private Synthetic Data: Applied Evaluations and Enhancements
-
Differentially Private Federated Learning of Diffusion Models for Synthetic Tabular Data Generation
-
Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe
-
On Renyi Differential Privacy in Statistics-Based Synthetic Data Generation
-
Comparative Study of Differentially Private Synthetic Data Algorithms from the NIST PSDP Challenge
-
Understanding the Impact of Data Domain Extraction on Synthetic Data Privacy
-
Comment on “NIST SP 800-226: Guidelines for Evaluating Differential Privacy Guarantees”
-
Advancing Differential Privacy: Where We Are Now and Future Directions for Real-World Deployment
-
Anonymeter: A Unified Framework for Quantifying Privacy Risk in Synthetic Data
-
A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage
-
SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains
-
Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs
-
Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework
-
An Evaluation of the Replicability of Analyses Using Synthetic Health Data
-
A Scoping Review of Privacy and Utility Metrics in Medical Synthetic Data
-
The DCR Delusion: Measuring the Privacy Risk of Synthetic Data
-
DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks
-
Debiasing Synthetic Data Generated by Deep Generative Models
-
Synthetic Data Generation Methods in Healthcare: A Review on Open-Source Tools and Methods
-
Large Language Models and Synthetic Health Data: Progress and Prospects
-
Ensemble Learning for Large Language Models in Text and Code Generation: A Survey
-
Better Synthetic Data by Retrieving and Transforming Existing Datasets
-
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
-
AgentInstruct: Toward Generative Teaching with Agentic Flows
-
Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation
-
The Synthetic Mirror – Synthetic Data at the Age of Agentic AI
-
On the Diversity of Synthetic Data and Its Impact on Training Large Language Models
-
Domain Adaptation: Challenges, Methods, Datasets, and Applications
-
Synthetic Data Can Benefit Medical Research — but Risks Must be Recognized
2. Articles, Reports and News
Section titled “2. Articles, Reports and News”Industry Resources
Section titled “Industry Resources”-
Synthetic Data: Navigating Its Methodologies, Applications and Challenges. GovTech Singapore Medium.
-
Complete Guide to Synthetic Data Generation for AI Models. Averroes AI.
-
Guide: Everything You Need to Know About Synthetic Data. Syntheticus AI.
-
What is Synthetic Data?. IBM Research.
-
Synthetic Data Generation (SDG). NVIDIA.
-
Synthetic Data in 2024: Progress, Opportunities, Challenges. Tim Lrx Blog.
-
The Promise and Perils of Synthetic Data. TechCrunch.
-
Synthetic Data Generation with LLMs. Towards Data Science.
-
LLMflation: LLM Inference Cost Trends. Andreessen Horowitz.
-
LLM Scaling Laws. Cameron R. Wolfe Substack.
-
Small Models, Big Wins: Four Reasons Enterprises are Choosing SLMs over LLMs. TechRadar Pro.
-
Why Small Language Models are the Next Big Thing in AI. VentureBeat.
-
Tech Companies are Turning to ‘Synthetic Data’ to Train AI Models – but There is a Hidden Cost. The Straits Times.
-
Gartner Peer Community Insights: Generative AI for Synthetic Data. Gartner.
-
Streamline & Accelerate AI Initiatives with Synthetic Data. IBM Think.
-
Afro-TTS: African English Text-to-Speech. Hugging Face.
-
Global Victim-Perpetrator Synthetic Dataset. Counter-Trafficking Data Collaborative.
-
Helm.ai: Synthetic Data for Autonomous Vehicle Training. Company Website.
-
Alpaca: Instruction-Following Language Model. Stanford CRFM.
-
How to Improve RAG Model Performance with Synthetic Data. Gretel.ai.
-
Tips to Improve Synthetic Data Accuracy. Gretel Documentation.
-
Data Simulation. Mostly AI.
-
How to Evaluate the Quality of Synthetic Data. AWS Machine Learning Blog.
-
The Fundamental Trilemma of Synthetic Data Generation. TMLT.
-
Cosmopedia: Synthetic Textbook Generation. Hugging Face.
-
DP-Auditorium: Differential Privacy Auditing Library. Google Research.
-
Protecting Users with Differentially Private Synthetic Training Data. Google Research.
-
Generating Synthetic Data with Differentially Private LLM Inference. Google Research.
-
Synthetic and Federated: Privacy-preserving Domain Adaptation with LLMs for Mobile Applications. Google Research.
-
SynthID: Watermarking for AI-Generated Content. Google DeepMind.
-
SYNTHLLM: Breaking the AI Data Wall with Scalable Synthetic Data. Microsoft Research.
-
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets. Amazon Science.
Government and Regulatory Resources
Section titled “Government and Regulatory Resources”-
AI Risk Management Framework. NIST.
-
Information Paper on AI Risk Management. Monetary Authority of Singapore.
-
Proposed Guide on Synthetic Data Generation. Personal Data Protection Commission Singapore.
-
FCA Feedback Statement on Synthetic Data. Financial Conduct Authority UK.
-
Artificial Data for Healthcare Applications. NHS Digital.
-
Simulacrum - Synthetic Cancer Dataset. Health Data Insight.
-
National Registry of Live Births - Synthetic Dataset. Israel Ministry of Health.
-
Survey of Income and Program Participation Synthetic Beta Data Product. US Census Bureau.
-
Practical Guide to Differential Privacy for Humanitarian Data. UNHCR.
3. Books
Section titled “3. Books”-
Practical Synthetic Data Generation by Khaled El Emam, Lucy Mosquera, Richard Hoptroff. O’Reilly Media.
-
AI Engineering by Chip Huyen. O’Reilly Media.
-
Practical Data Privacy by Katharine Jarmul. O’Reilly Media.
-
Hands-on Differential Privacy by Ethan Cowan, Michael Shoemate, Mayana Pereira. O’Reilly Media.