Skip to content

7. LLM-Driven Data Synthesis

Large Language Models (LLMs) have emerged as powerful tools for synthetic data generation (SDG), offering unprecedented flexibility and contextual understanding. This chapter explores how to leverage LLMs effectively for creating high-quality synthetic datasets across various domains and use cases. The widespread use, accessibility and practical usefulness of LLMs warrants a dedicated focus with this chapter.

Synthetic Data Overview

Figure 1: Meticulous prompt engineering to generate diverse textbook samples, leveraging the diversity of audiences (from left to right: young children, professionals, researchers, and high school students) and styles.


Image reference:

Hugging Face’s Cosmopedia blog

LLMs exhibit a range of capabilities that make them especially powerful for SDG:

  • Broad knowledge representation and pattern combination: They internalize complex probability distributions and dependencies across data and therefore can generate coherent and novel combinations by understanding how elements interact.
  • In-context learning: They adapt to new tasks with just a few examples provided in the prompt.
  • Controllability through prompting: Users can guide generation through instructions, constraints, and formatting specifications, providing fine-grained control over synthetic data characteristics.
  • Emergent reasoning: At scale, LLMs gain unexpected abilities like multi-step problem solving, in-context learning, and abstract reasoning — enabling them to generalize, infer, and reason beyond surface patterns.
  • Structured outputs: Modern models handle cross-modal inputs (text, images, audio) and generate structured outputs (tables, code, JSON).
Key methodologies and considerations covered in this chapter

Figure 2: Key methodologies and considerations covered in this chapter, organized into four main categories: Generation techniques (from basic prompting to fine-tuning), Curation and Filtering methods for quality control, Evaluation metrics for assessing synthetic data quality, and Risk Assessment for identifying potential safety concerns. The arrows indicate the typical workflow.

LLM-driven SDG follows two primary approaches: prompt-based generation and fine-tuning, each offering different advantages for different use cases and resource constraints1.

The approaches introduced in this chapter are illustrative, not exhaustive, as the field is evolving rapidly.

Prompt engineering forms the foundation of LLM-driven SDG, enabling practitioners to guide models toward desired outputs through carefully crafted instructions and examples.

Uses the model’s pre-trained knowledge without providing examples, suitable for generating common data types where the model’s training provides sufficient context.

Example:

Generate 10 synthetic citizen feedback entries for Singapore's
HealthHub digital services. Include variety in sentiment (positive,
neutral, negative) and feedback length. Make them realistic but
not based on real citizens.

Guides the model through explicit reasoning steps before generating final output, particularly valuable for creating logically consistent data.

Example:

Generate synthetic policy recommendation documents for Singapore's Smart Nation initiative.
Think step by step:
1. First, identify a specific Smart Nation domain (transport, healthcare, urban planning, etc.)
2. Consider the current challenges in that domain
3. Propose evidence-based solutions
4. Structure as formal policy recommendation with rationale
Generate 3 policy recommendations following this reasoning process.

Sample Step-by-Step Reasoning:

Policy Recommendation 1 - Urban Transport:
Domain: Intelligent Transportation Systems
Challenge: Peak hour congestion in CBD areas
Evidence: Traffic data shows 35% increase during 8-9 AM
Solution: Dynamic road pricing with AI-optimized rates
Rationale: Reduces congestion while generating revenue for infrastructure...

Enhances LLM outputs by incorporating external information in the prompt, improving factual accuracy and relevance. This can be achieved through external knowledge sources such as RAG (Retrieval-Augmented Generation), knowledge graphs2, or retrieval augmentation from websites.

Example: Using RAG, Retrieve relevant information → Include in prompt context → Generate data based on facts

Task: Generate synthetic policy Q&A for housing policies
Step 1 - Retrieve: Get current HDB eligibility criteria from official sources
Step 2 - Context: Include retrieved facts in prompt
Step 3 - Generate: Create Q&A pairs based on verified information
"Question: What are the income requirements for first-time BTO applicants?
Answer: Based on current HDB guidelines, the monthly household income
ceiling is $14,000 for non-mature estates and $7,000 for mature estates..."

While prompt engineering is flexible, fine-tuning might be required when one needs specialized behavior, consistent output formatting, or optimization for specific organizational use cases.

  • Full Fine-Tuning: Update all LLM parameters on task-specific datasets for maximum customization.

  • Parameter-Efficient Fine-Tuning (PEFT): Update only a small subset of parameters, keeping most of the pretrained model frozen. This reduces compute cost and preserves general knowledge.

  • Instruction Tuning: Fine-tune on diverse instruction–response datasets to improve general task-following and formatting.

LLM-generated data needs systematic curation to mitigate hallucinations, inconsistencies, and quality variations. Effective curation ensures synthetic data aligns with downstream requirements through two main strategies:

Sample Filtering Identify and handle low-quality outputs.

  • Heuristic checks: Rule-based validation, confidence thresholds, or domain-specific quality signals.
  • LLM-as-a-judge: A separate LLM evaluates outputs against explicit criteria, sometimes in ensembles for reliability.
  • Re-weighting: Instead of binary accept/reject, assign weights to emphasize more useful samples while retaining diversity.

Label Enhancement Improve or standardize annotations.

  • Human review: Experts validate data in critical domains (high cost but high reliability).
  • Auxiliary models: Smaller specialized models refine or correct labels at scale; active learning can route only uncertain cases to humans.

LLM-generated synthetic data requires adapted evaluation and risk management approaches that extend beyond the other SDG methods.

The dimensions introduced in the Quality Evaluation chapter still apply, but in the LLM context they take on additional nuance:

  • Factualness: Verifies that generated content is accurate and internally consistent. Methods include cross-referencing with authoritative sources, domain expert validation, and entailment-based consistency tests (using natural language inference models to check if one statement logically follows from another). For e.g., if the model outputs “Singapore is in Asia,” this is consistent with “Asia contains Singapore,” while “Singapore is in Europe” would be flagged as a contradiction.

  • Relevance: Ensures the data supports the intended application. Methods include task-specific performance testing, alignment scoring, and end-user validation. For e.g., synthetic fraud data should improve a fraud detection model’s precision, not just look realistic in isolation.

  • Diversity: Confirms broad coverage across scenarios, demographics, and edge cases. Techniques include statistical sampling analysis, n-gram repetition checks, and demographic audits. For e.g., a health dataset should not only cover common conditions like flu but also rare cases like epilepsy, ensuring minority representation is preserved.

  • Privacy: Guards against memorization or leakage of sensitive training data, including the risk of generating highly realistic but fabricated individual records. Approaches include membership inference tests, leakage detection tools, manual red-teaming, and compliance checks. For e.g., if an LLM generates (regurgitates) a real patient’s name and diagnosis that was present in its training data, privacy safeguards should flag and block that output.

LLM-driven generation introduces specific risks that require proactive controls:

  • Hallucination: LLMs may produce plausible but incorrect outputs that mislead analysis or training. Mitigations include RAG with trusted sources, expert review, structured review workflows, and human oversight in sensitive domains.

  • Bias Amplification and Content Issues: LLMs may reinforce societal biases or generate inappropriate outputs, including offensive or copyrighted content. Mitigations include demographic bias audits, fairness constraints, balanced prompts, content filtering, and evaluation by diverse teams.

  • Prompt Injection and Manipulation: Malicious or adversarial prompts can subvert controls and trigger unsafe or low-quality outputs. Mitigations include validated prompt templates, input sanitization, content filters, and regular adversarial testing.

  • Model Collapse: Repeated training on synthetic outputs can degrade quality and reduce diversity as models “forget” reality. Mitigations include mixing synthetic with real data, applying quality-control filters, and verifying outputs to prevent compounding errors.

Model Collapse Illustration

Figure 4: Model Collapse occurs when AI models are trained repeatedly on their own outputs, gradually losing fidelity and diversity—like making a copy of a copy. The sequence (right) shows a diffusion model retrained on its own images: faces become increasingly warped and lose detail with each iteration.


Image reference (left):

The Curse of Recursion: Training on Generated Data Makes Models Forget research paper


(right):

Nepotistically Trained Generative Image Models Collapse research paper

Understanding these evaluation criteria and limitations enables responsible deployment of LLM-generated data, ensuring benefits are balanced against potential risks.


The next chapter covers the real world deployments across diverse sectors and for diverse use cases.

  1. While resource constraints are often cited as bottlenecks in SDG, this interesting post by Anyscale’s co-founder states that compute resources can help us achieve high-quality synthetic data by enabling advanced approaches like reasoning, multi-step thinking, task decomposition, and agentic workflows.

  2. Microsoft Research introduced the SYNTHLLM framework for SDG (2025), a novel approach to SDG that constructs and leverages a “global concept graph” of high-level topics and key concepts within a domain (in this work, the authors focused on mathematics). It enables systematic sampling and recombination of diverse concepts across multiple documents, providing “knowledge-infused” prompts to the LLM and leading to the generation of unique, diverse, and complex questions.