9. Applications in the Public Sector

As we noted in the previous chapter on Real World Deployments, government agencies and enterprises are adopting synthetic data as a Privacy-Enhancing Technology (PET) and AI accelerator to address data bottlenecks while supporting responsible AI development, research, and innovation.

This chapter presents practical applications drawn from our public sector experience. We organize these into broad categories that span diverse use cases—from operational testing to AI development and secure data sharing—with insights applicable to both government agencies and enterprises worldwide.

Figure: Survey by GovTech’s data privacy team showed varied potential benefits of synthetic data use and interest across diverse Singapore government agencies.

1. Safe and Scalable AI Development

AI is powering significant government applications, with innovation projects to solve real problems such as Named Entity Recognition (NER) for PII detection in sensitive documents, LLM-powered assistance systems, and multimodal AI-assisted cataloguing. Synthetic data is proving valuable for training, testing, and evaluating AI/ML models—particularly where data scarcity, non-existence or sensitivity limits access to real datasets for government applications.

Example Applications

Dataset Augmentation: Enhance training datasets for classification and regression tasks where real government data is limited, enabling more accurate AI-driven policy analysis and evidence-based decision-making.
Representation Enhancement: Generate records for vulnerable or under-represented groups to improve model fairness and inclusivity across diverse citizen demographics.
Data Imputation: Fill missing or erroneous values with synthetic ones. Hardware readings often contain gaps—synthetic data can help complete these with statistically representative values, such as filling missing utility-meter data.
Scenario and System Testing: Generate datasets to test AI models against hypothetical situations—for e.g., computer vision systems detecting unusual pedestrian behavior or validating Retrieval-Augmented Generation (RAG) [^1] systems with diverse synthetic queries before production deployment.

Student learning enhancement
: Generated student–chatbot conversations across diverse personas (e.g., high‑performing students, weak English speakers, off‑topic students, counterfactual students) to evaluate Ministry of Education (MOE)‘s SLS Learning Assistant before deployment, ensuring 99.3% response accuracy via faithfulness and factuality testing when real interaction data was unavailable.
Guardrails training for safety and compliance
: Created over two million prompt pairs to train off‑topic detection guardrails for LLM applications in pre‑production, using high‑temperature generation and structured outputs to simulate diverse misuse scenarios and edge cases. :::

Data sharing is essential for innovation in Smart Nations like Singapore, where agencies frequently need to share data both internally—between agencies, across departments and teams—and externally with private firms, international partners, and university researchers. Synthetic data enables sharing when real data is restricted by privacy concerns, security classifications (e.g., data classification levels, NDAs), or regulatory requirements (e.g., public sector privacy guidelines, internal agency-level policies).

Example Applications

Research Partnerships: Institutes of Higher Learning (IHLs) and academic institutions can access synthetic versions of sensitive citizen datasets for research ¹
Inter-Agency Collaboration: Government agencies can share data when privacy risks, security classifications, or legal agreements prevent direct access to real datasets.
Third-Party Integration: External vendors and contractors can develop and test systems using synthetic data instead of accessing sensitive government records.

Please note: For privacy-preserving data sharing, synthetic data requires effective privacy safeguards and evaluation especially for the high risk scenarios such as sharing with external vendors—see Privacy-preserving Synthesis for detailed guidance.

3. Operational Efficiency: Testing, Exploration and Innovation

Testing applications is perhaps the most common use case for synthetic data in the public sector we have encountered so far. This extends beyond AI-related testing to include scalability testing, software development, and system validation. Synthetic data enables agencies to test systems and develop tools without relying on sensitive or production data, reducing security risks and operational complexity. It also supports secure exploratory work—prototyping, hackathons, and early validation—especially when real data access is restricted or time-consuming.

Example Applications

Security Testing: Use synthetic code and log data to evaluate security tools while enabling diverse testing scenarios, augmenting limited datasets, and protecting sensitive operational data.
Scalability Testing: Generate large-scale synthetic clinical text to test PII detection models during development, ensuring coverage of rare and edge-case scenarios before production deployment.
Rapid Prototyping: Accelerate early-stage model development and concept validation with immediate access to realistic datasets, reducing dependency on lengthy data access approval processes.
Innovation Events: Conduct hackathons and innovation sprints using synthetic datasets to avoid exposing sensitive government information while maintaining realistic data characteristics.

These applications demonstrate how synthetic data addresses critical government challenges. As agencies increasingly recognize the potential, successful implementation requires careful planning, appropriate privacy measures, and alignment with organizational data governance frameworks.

The next chapter sheds light on the challenges and risks. Understanding these challenges is essential for encouraging responsible and sustainable adoption in environments where public trust and operational reliability are critical.

The UK’s National Health Service (NHS) operates an “artificial data” service that provides researchers with access to over 200 healthcare datasets in artificial form. This service helps users understand data structure, fields, and approximate value ranges before applying for access to real patient data. Notably, this system uses a purely statistical approach—employing classical statistical distributions, frequency sampling, and percentile-based generation—rather than machine learning algorithms. This design choice prioritizes transparency, regulatory compliance, and privacy essential for healthcare applications, though it means the artificial data cannot preserve complex relationships between fields. ↩

9. Applications in the Public Sector

1. Safe and Scalable AI Development

Example Applications

Footnotes