Overview

Conceptual representation of synthetic data generation

Conceptual representation: Like robots manufacturing similar robots, synthetic data generation creates artificial data that mirrors the patterns of real data.

Image reference:

Best practices and lessons learned on synthetic data for language models (2024) by Google DeepMind.

Synthetic data—artificially generated data that reflects real-world patterns—is gaining momentum as a way to ease data bottlenecks and support more ethical, responsible AI. Major AI labs (Anthropic, Meta, Nvidia, Microsoft, OpenAI) are increasingly supplementing real-world corpora with synthetic data. Governments in Singapore and worldwide are also exploring it as a practical response to data quality and privacy challenges.

Tech leaders
- Andrej Karpathy: “It’s [synthetic data] the only way we can make progress… just be careful.”¹
- Elon Musk: “The only way to supplement real-world data is with synthetic data.”²
- Aidan Gomez (Cohere CEO; co-author of Attention Is All You Need): “The current LLM market is dominated by synthetic data. We push a lot on that”³
  
  (with similar views shared by Sam Altman, Jensen Huang, and others)
Government (Minister Josephine Teo, Singapore): “Synthetic data shows promise as a solution as it creates realistic data for AI model training without using the actual sensitive data.”⁴
Regulation: In EU AI Act, synthetic data is explicitly referenced in several articles.⁵
Economy: Market outlook: $351M (2023) → $2.3B (2030), ~31% CAGR.⁶

To make this field more accessible to practitioners, researchers, decision-makers, and potential adopters, we have created this primer. For guidance on where to start, refer to our reading guide , which suggests chapters based on your interests and needs.

After reading, you’ll walk away with these key insights:

Synthetic data goes far beyond basic rule-based methods and requires a sound practical approach to choose a generation methodology—it spans diverse modalities and leverages everything from classical statistics to advanced foundation models, with practical guidelines for choosing the right synthesizer for your use case. (What is Synthetic Data, Generation Methods)
Every use case comes with caveats— synthetic data enables powerful applications, but each has limitations and trade-offs to understand before implementation. (Why Use Synthetic Data)
Evaluation requires multiple dimensions and reasonable metrics, along with an understanding of their conflicting nature in context—choosing the right quality dimensions and metrics is critical to ensuring synthetic data serves its intended purpose. (Quality Evaluation)
Synthetic data is not private by default and requires context-driven risk evaluation—robust privacy requires threat modeling and, where needed, formal guarantees with differential privacy. (Privacy-preserving Synthesis)
Success requires a systematic synthesis approach—it’s not about finding the “right” AI model, but taking a systematic approach to align with your specific requirements. (Practical Data Synthesis)
LLM-driven approaches are reshaping the field but require well-thought generation strategies—with rapid progress in prompting and fine-tuning approaches, but challenges and risks persist. (LLM-driven Data Synthesis)
Synthetic data already has real-world impact across diverse sectors and use cases—with applications spanning from healthcare to policy-making. (Real World Deployments, Applications in Public Sector)
Responsible development is essential, treating challenges as design constraints rather than deal-breakers—like any powerful technology, synthetic data demands attention to risks and emerging challenges. (Challenges and Risks)
Staying current with recent developments is essential—to improve capabilities, quality, risk mitigation, governance frameworks, and interpretability: possibility → engineering → governance → insight. (Outlook & Trends)

This primer is a living document. As the field evolves, we will continue to update it to reflect the latest developments. We hope it proves useful for your work, and we welcome your feedback as we refine it further.

About Us

This primer is an initiative of the Government Technology Agency of Singapore (GovTech Singapore)’s Data Practice. Our work spans key domains such as data engineering, privacy-enhancing technologies, and data platforms, governance and quality. We deliver outputs that include advanced prototypes and agency pilots, as well as shared standards, playbooks, and training. Through co-creation with agencies, IHLs, and industry, we translate promising ideas into production-ready solutions aligned with government policy.

Specifically for data privacy, our contributions include experimental projects, research translated into practice, publications, presentations on international platforms, and papers at top-tier academic venues.

Learn more about us here.

Acknowledgments

We extend our sincere gratitude to the expert reviewers (Refer Credits) who provided valuable feedback and insights on this primer. Their expertise and thoughtful reviews have significantly contributed to the quality and accuracy of this work.

Overview

About Us

Acknowledgments

Footnotes