Skip to content

Reading Guide

This primer provides a comprehensive introduction to synthetic data, covering everything from basic concepts to advanced applications. Each chapter is designed to be as standalone as possible, so you can jump to topics that interest you most. This guide helps you navigate the content based on your background and interests.

Reader TypePrimary FocusRecommended Reading Path
BeginnersUnderstanding fundamentalsWhat is Synthetic DataWhy Use Synthetic DataGeneration MethodsQuality EvaluationReal World Deployments
PractitionersImplementation & developmentWhy Use Synthetic DataGeneration MethodsQuality EvaluationPrivacy-preserving SynthesisPractical Data SynthesisLLM-driven Data SynthesisReal World DeploymentsChallenges and Risks
Decision MakersBusiness value & strategyWhy Use Synthetic DataReal World DeploymentsApplications in Public SectorChallenges and RisksOutlook & Trends
Privacy OfficersCompliance & risk managementPrivacy-preserving SynthesisReal World DeploymentsApplications in Public SectorChallenges and Risks
ResearchersLatest developments & methodsLLM-driven Data SynthesisChallenges and RisksOutlook & Trends

These fundamental concepts appear throughout the primer. For detailed definitions, see our Glossary of Terms.

  • Synthetic Data - Artificial data generated by algorithms that imitates the patterns of real data, such as survey responses, financial transactions, or sensor readings collected from actual people, events, or systems
  • Synthesizer - The AI system that creates synthetic data by learning from real examples
  • Generative Models - AI that can create new content (like text, images, or tabular data) after learning from examples
  • Training Dataset - The real data used to teach the AI what patterns to copy
  • Held-out Dataset - Real data kept separate to test how well the synthetic data works
  • Fine-tuning - Customizing an AI model for a specific task or industry
  • Prompt Engineering - Writing instructions to get AI to produce what you want
  • Data Modality - The format of data (spreadsheets, photos, text, audio recordings, etc.)
  • Privacy Preservation - Managing the risk of exposing people’s identifiable or sensitive information

Throughout the primer, you’ll encounter four types of highlighted information:

  • Glossary of Terms - Definitions of technical concepts and terminology
  • References - Cited research papers and additional resources
  • Credits - Contributors, acknowledgments, and version history