Skip to content

References

This page contains all references cited throughout the Synthetic Data Primer, along with additional resources for readers to explore the topic.

  1. Synthetic Data — what, why and how?

  2. A Comprehensive Survey of Synthetic Tabular Data Generation

  3. Comprehensive Exploration of Synthetic Data Generation: A Survey

  4. A Systematic Review of Synthetic Data Generation Techniques Using Generative AI

  5. Synthetic Data in AI: Challenges, Applications, and Ethical Implications

  6. Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data

  7. Best Practices and Lessons Learned on Synthetic Data

  8. Modeling tabular data using conditional GAN

  9. A Hyperparameter Tuning Framework for Tabular Synthetic Data Generation Methods

  10. Navigating Tabular Data Synthesis Research: Understanding User Needs and Tool Capabilities

  11. TabularARGN: A Flexible and Efficient Auto-Regressive Framework for Generating High-Fidelity Synthetic Data

  12. What’s Wrong with Your Synthetic Tabular Data? Using Explainable AI to Evaluate Generative Models

  13. How Well Does Your Tabular Generator Learn the Structure of Tabular Data?

  14. On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

  15. Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Understanding - A Survey

  16. Are LLMs Naturally Good at Synthetic Tabular Data Generation?

  17. Harnessing Large-Language Models to Generate Private Synthetic Text

  18. Synthetic Data Generation Using Large Language Models: Advances in Text and Code

  19. LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation

  20. Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations

  21. A Survey of LLM-Based Methods for Synthetic Data Generation and the Rise of Agentic Workflows

  22. Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World

  23. Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

  24. AI Models Collapse When Trained on Recursively Generated Data

  25. A Note on Shumailov et al. (2024): ‘AI Models Collapse When Trained on Recursively Generated Data’

  26. The Curse of Recursion: Training on Generated Data Makes Models Forget

  27. Nepotistically Trained Generative Image Models Collapse

  28. Scaling Laws of Synthetic Data for Language Models

  29. Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data

  30. The Algorithmic Foundations of Differential Privacy

  31. Differentially Private Release of Israel’s National Registry of Live Births

  32. On the Challenges of Deploying Privacy-Preserving Synthetic Data in the Enterprise

  33. Differentially Private Synthetic Data: Applied Evaluations and Enhancements

  34. Differentially Private Federated Learning of Diffusion Models for Synthetic Tabular Data Generation

  35. Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe

  36. On Renyi Differential Privacy in Statistics-Based Synthetic Data Generation

  37. PrivBayes: Private Data Release via Bayesian Networks

  38. Comparative Study of Differentially Private Synthetic Data Algorithms from the NIST PSDP Challenge

  39. Understanding the Impact of Data Domain Extraction on Synthetic Data Privacy

  40. Comment on “NIST SP 800-226: Guidelines for Evaluating Differential Privacy Guarantees”

  41. Advancing Differential Privacy: Where We Are Now and Future Directions for Real-World Deployment

  42. How Faithful is Your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models

  43. The Inadequacy of Similarity-based Privacy Metrics: Privacy Attacks against ‘Truly Anonymous’ Synthetic Datasets

  44. Anonymeter: A Unified Framework for Quantifying Privacy Risk in Synthetic Data

  45. Synthetic Data–Anonymisation Groundhog Day

  46. What Has Been Lost with Synthetic Evaluation?

  47. A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage

  48. SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains

  49. Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs

  50. Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework

  51. An Evaluation of the Replicability of Analyses Using Synthetic Health Data

  52. A Scoping Review of Privacy and Utility Metrics in Medical Synthetic Data

  53. The DCR Delusion: Measuring the Privacy Risk of Synthetic Data

  54. DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks

  55. Can Synthetic Data be Fair and Private? A Comparative Study of Synthetic Data Generation and Fairness Algorithms

  56. Debiasing Synthetic Data Generated by Deep Generative Models

  57. Synthetic Data Generation Methods in Healthcare: A Review on Open-Source Tools and Methods

  58. Large Language Models Generating Synthetic Clinical Datasets: A Feasibility and Comparative Analysis with Real-World Perioperative Data

  59. Large Language Models and Synthetic Health Data: Progress and Prospects

  60. Empowering Time Series Analysis with Synthetic Data: A Survey and Outlook in the Era of Foundation Models

  61. Case2Code: Scalable Synthetic Data for Code Generation

  62. Ensemble Learning for Large Language Models in Text and Code Generation: A Survey

  63. Better Synthetic Data by Retrieving and Transforming Existing Datasets

  64. Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

  65. Towards Internet-Scale Training for Agents

  66. AgentInstruct: Toward Generative Teaching with Agentic Flows

  67. Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation

  68. The Synthetic Mirror – Synthetic Data at the Age of Agentic AI

  69. Towards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective

  70. On the Diversity of Synthetic Data and Its Impact on Training Large Language Models

  71. On the Legal Nature of Synthetic Data

  72. Synthetic Data RL: Task Definition Is All You Need

  73. Fine-tuning Large Language Models for Domain Adaptation: Exploration of Training Strategies, Scaling, Model Merging and Synergistic Capabilities

  74. Domain Adaptation: Challenges, Methods, Datasets, and Applications

  75. Synthetic Data Can Benefit Medical Research — but Risks Must be Recognized

  1. Synthetic Data: Navigating Its Methodologies, Applications and Challenges. GovTech Singapore Medium.

  2. Complete Guide to Synthetic Data Generation for AI Models. Averroes AI.

  3. Guide: Everything You Need to Know About Synthetic Data. Syntheticus AI.

  4. What is Synthetic Data?. IBM Research.

  5. Synthetic Data Generation (SDG). NVIDIA.

  6. Synthetic Data in 2024: Progress, Opportunities, Challenges. Tim Lrx Blog.

  7. The Promise and Perils of Synthetic Data. TechCrunch.

  8. Synthetic Data Generation with LLMs. Towards Data Science.

  9. LLMflation: LLM Inference Cost Trends. Andreessen Horowitz.

  10. LLM Scaling Laws. Cameron R. Wolfe Substack.

  11. Small Models, Big Wins: Four Reasons Enterprises are Choosing SLMs over LLMs. TechRadar Pro.

  12. Why Small Language Models are the Next Big Thing in AI. VentureBeat.

  13. Tech Companies are Turning to ‘Synthetic Data’ to Train AI Models – but There is a Hidden Cost. The Straits Times.

  14. Gartner Peer Community Insights: Generative AI for Synthetic Data. Gartner.

  15. Best Synthetic Data Generation Tools for 2025. K2View.

  16. Streamline & Accelerate AI Initiatives with Synthetic Data. IBM Think.

  17. Afro-TTS: African English Text-to-Speech. Hugging Face.

  18. Global Victim-Perpetrator Synthetic Dataset. Counter-Trafficking Data Collaborative.

  19. Helm.ai: Synthetic Data for Autonomous Vehicle Training. Company Website.

  20. Alpaca: Instruction-Following Language Model. Stanford CRFM.

  21. How to Improve RAG Model Performance with Synthetic Data. Gretel.ai.

  22. Tips to Improve Synthetic Data Accuracy. Gretel Documentation.

  23. Data Simulation. Mostly AI.

  24. How to Evaluate the Quality of Synthetic Data. AWS Machine Learning Blog.

  25. The Fundamental Trilemma of Synthetic Data Generation. TMLT.

  26. Understanding Missing Data Mechanisms. YData.

  27. Cosmopedia: Synthetic Textbook Generation. Hugging Face.

  28. DP-Auditorium: Differential Privacy Auditing Library. Google Research.

  29. Protecting Users with Differentially Private Synthetic Training Data. Google Research.

  30. Generating Synthetic Data with Differentially Private LLM Inference. Google Research.

  31. Synthetic and Federated: Privacy-preserving Domain Adaptation with LLMs for Mobile Applications. Google Research.

  32. SynthID: Watermarking for AI-Generated Content. Google DeepMind.

  33. SYNTHLLM: Breaking the AI Data Wall with Scalable Synthetic Data. Microsoft Research.

  34. MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets. Amazon Science.

  1. AI Risk Management Framework. NIST.

  2. Information Paper on AI Risk Management. Monetary Authority of Singapore.

  3. Proposed Guide on Synthetic Data Generation. Personal Data Protection Commission Singapore.

  4. FCA Feedback Statement on Synthetic Data. Financial Conduct Authority UK.

  5. Artificial Data for Healthcare Applications. NHS Digital.

  6. Simulacrum - Synthetic Cancer Dataset. Health Data Insight.

  7. National Registry of Live Births - Synthetic Dataset. Israel Ministry of Health.

  8. Survey of Income and Program Participation Synthetic Beta Data Product. US Census Bureau.

  9. Practical Guide to Differential Privacy for Humanitarian Data. UNHCR.

  1. Practical Synthetic Data Generation by Khaled El Emam, Lucy Mosquera, Richard Hoptroff. O’Reilly Media.

  2. AI Engineering by Chip Huyen. O’Reilly Media.

  3. Practical Data Privacy by Katharine Jarmul. O’Reilly Media.

  4. Hands-on Differential Privacy by Ethan Cowan, Michael Shoemate, Mayana Pereira. O’Reilly Media.

  1. Anonymeter: Privacy Risk Evaluation

  2. Differential Privacy Library

  3. OpenDP: Differential Privacy Platform

  4. SmartNoise SDK

  5. SDMetrics: Synthetic Data Evaluation

  6. DataDreamer: Structured Data Generation Workflows

  7. Dria SDK: Distributed Synthetic Data Generation

  8. SDV: Synthetic Data Generation

  9. SynthCity: Synthetic Data Generation Library

  10. SynthFlow

  11. YData Synthetic

  12. HyperImpute: Smart Imputation Methods

  13. MostlyAI Quality Assurance

  14. RDT: Reversible Data Transforms

  15. YData Profiling

  16. TSGBench: Time Series Generation Benchmark

  17. REaLTabFormer: Multi-table Generation