Overview

Conceptual representation: Like robots manufacturing similar robots,
synthetic data generation creates artificial data that mirrors the
patterns of real data.
Image reference:
Best practices and lessons learned on synthetic data for language models (2024) by Google DeepMind.
Synthetic data—artificially generated data that reflects real-world patterns—is gaining momentum as a way to ease data bottlenecks and support more ethical, responsible AI. Major AI labs (Anthropic, Meta, Nvidia, Microsoft, OpenAI) are increasingly supplementing real-world corpora with synthetic data. Governments in Singapore and worldwide are also exploring it as a practical response to data quality and privacy challenges.
To make this field more accessible to practitioners, researchers, decision-makers, and potential adopters, we have created this primer. For guidance on where to start, refer to our reading guide , which suggests chapters based on your interests and needs.
After reading, you’ll walk away with these key insights:
-
Synthetic data goes far beyond basic rule-based methods and requires a sound practical approach to choose a generation methodology—it spans diverse modalities and leverages everything from classical statistics to advanced foundation models, with practical guidelines for choosing the right synthesizer for your use case. (What is Synthetic Data, Generation Methods)
-
Every use case comes with caveats— synthetic data enables powerful applications, but each has limitations and trade-offs to understand before implementation. (Why Use Synthetic Data)
-
Evaluation requires multiple dimensions and reasonable metrics, along with an understanding of their conflicting nature in context—choosing the right quality dimensions and metrics is critical to ensuring synthetic data serves its intended purpose. (Quality Evaluation)
-
Synthetic data is not private by default and requires context-driven risk evaluation—robust privacy requires threat modeling and, where needed, formal guarantees with differential privacy. (Privacy-preserving Synthesis)
-
Success requires a systematic synthesis approach—it’s not about finding the “right” AI model, but taking a systematic approach to align with your specific requirements. (Practical Data Synthesis)
-
LLM-driven approaches are reshaping the field but require well-thought generation strategies—with rapid progress in prompting and fine-tuning approaches, but challenges and risks persist. (LLM-driven Data Synthesis)
-
Synthetic data already has real-world impact across diverse sectors and use cases—with applications spanning from healthcare to policy-making. (Real World Deployments, Applications in Public Sector)
-
Responsible development is essential, treating challenges as design constraints rather than deal-breakers—like any powerful technology, synthetic data demands attention to risks and emerging challenges. (Challenges and Risks)
-
Staying current with recent developments is essential—to improve capabilities, quality, risk mitigation, governance frameworks, and interpretability: possibility → engineering → governance → insight. (Outlook & Trends)
This primer is a living document. As the field evolves, we will continue to update it to reflect the latest developments. We hope it proves useful for your work, and we welcome your feedback as we refine it further.
About Us
Section titled “About Us”This primer is an initiative of the Government Technology Agency of Singapore (GovTech Singapore)’s Data Practice. Our work spans key domains such as data engineering, privacy-enhancing technologies, and data platforms, governance and quality. We deliver outputs that include advanced prototypes and agency pilots, as well as shared standards, playbooks, and training. Through co-creation with agencies, IHLs, and industry, we translate promising ideas into production-ready solutions aligned with government policy.
Specifically for data privacy, our contributions include experimental projects, research translated into practice, publications, presentations on international platforms, and papers at top-tier academic venues.
Learn more about us here.
Acknowledgments
Section titled “Acknowledgments”We extend our sincere gratitude to the expert reviewers (Refer Credits) who provided valuable feedback and insights on this primer. Their expertise and thoughtful reviews have significantly contributed to the quality and accuracy of this work.
Footnotes
Section titled “Footnotes”-
AI Researcher Interview with Andrej Karpathy
↩ -
TechCrunch: Elon Musk agrees that we’ve exhausted AI training data
↩ -
OfficeChai: When Human Data Is Too Expensive, Synthetic Data Could be Be Used To Train AI Models: Cohere CEO Aidan Gomez
↩ -
AsiaOne: ‘Safety labels’ in AI apps to clearly state risks, testing in discussion: Josephine Teo
↩ -
ClearBox AI: The role of synthetic data within the EU AI Act
↩ -
Fortune Business Insights: Synthetic Data Generation Market
↩