Skip to content

References

This page contains all references cited throughout the Synthetic Data Guide, along with additional resources for readers to explore the topic.

1. Research Papers

2. Articles, Reports and News

Industry Resources

Synthetic Data: Navigating Its Methodologies, Applications and Challenges. GovTech Singapore Medium.
Complete Guide to Synthetic Data Generation for AI Models. Averroes AI.
Guide: Everything You Need to Know About Synthetic Data. Syntheticus AI.
What is Synthetic Data?. IBM Research.
Synthetic Data Generation (SDG). NVIDIA.
Synthetic Data in 2024: Progress, Opportunities, Challenges. Tim Lrx Blog.
The Promise and Perils of Synthetic Data. TechCrunch.
Synthetic Data Generation with LLMs. Towards Data Science.
LLMflation: LLM Inference Cost Trends. Andreessen Horowitz.
LLM Scaling Laws. Cameron R. Wolfe Substack.
Small Models, Big Wins: Four Reasons Enterprises are Choosing SLMs over LLMs. TechRadar Pro.
Why Small Language Models are the Next Big Thing in AI. VentureBeat.
Tech Companies are Turning to ‘Synthetic Data’ to Train AI Models – but There is a Hidden Cost. The Straits Times.
Gartner Peer Community Insights: Generative AI for Synthetic Data. Gartner.
Best Synthetic Data Generation Tools for 2025. K2View.
Streamline & Accelerate AI Initiatives with Synthetic Data. IBM Think.
Afro-TTS: African English Text-to-Speech. Hugging Face.
Global Victim-Perpetrator Synthetic Dataset. Counter-Trafficking Data Collaborative.
Helm.ai: Synthetic Data for Autonomous Vehicle Training. Company Website.
Alpaca: Instruction-Following Language Model. Stanford CRFM.
How to Improve RAG Model Performance with Synthetic Data. Gretel.ai.
Tips to Improve Synthetic Data Accuracy. Gretel Documentation.
Data Simulation. Mostly AI.
How to Evaluate the Quality of Synthetic Data. AWS Machine Learning Blog.
The Fundamental Trilemma of Synthetic Data Generation. TMLT.
Understanding Missing Data Mechanisms. YData.
Cosmopedia: Synthetic Textbook Generation. Hugging Face.
DP-Auditorium: Differential Privacy Auditing Library. Google Research.
Protecting Users with Differentially Private Synthetic Training Data. Google Research.
Generating Synthetic Data with Differentially Private LLM Inference. Google Research.
Synthetic and Federated: Privacy-preserving Domain Adaptation with LLMs for Mobile Applications. Google Research.
SynthID: Watermarking for AI-Generated Content. Google DeepMind.
SYNTHLLM: Breaking the AI Data Wall with Scalable Synthetic Data. Microsoft Research.
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets. Amazon Science.

Government and Regulatory Resources

AI Risk Management Framework. NIST.
Information Paper on AI Risk Management. Monetary Authority of Singapore.
Proposed Guide on Synthetic Data Generation. Personal Data Protection Commission Singapore.
FCA Feedback Statement on Synthetic Data. Financial Conduct Authority UK.
Artificial Data for Healthcare Applications. NHS Digital.
Simulacrum - Synthetic Cancer Dataset. Health Data Insight.
National Registry of Live Births - Synthetic Dataset. Israel Ministry of Health.
Survey of Income and Program Participation Synthetic Beta Data Product. US Census Bureau.
Practical Guide to Differential Privacy for Humanitarian Data. UNHCR.

3. Books

Practical Synthetic Data Generation by Khaled El Emam, Lucy Mosquera, Richard Hoptroff. O’Reilly Media.
AI Engineering by Chip Huyen. O’Reilly Media.
Practical Data Privacy by Katharine Jarmul. O’Reilly Media.
Hands-on Differential Privacy by Ethan Cowan, Michael Shoemate, Mayana Pereira. O’Reilly Media.

4. Others

Open Source Tools and Software