Simulation-Based Synthetic Data Generation

Description

Generates synthetic datasets through computational simulation of underlying data-generating processes, encompassing statistical methods (copula models, parametric distribution fitting), agent-based models, physics-informed simulators, and Monte Carlo sampling. Unlike neural-network-based approaches, these methods encode explicit domain knowledge or statistical structure into the generation process — copulas model multivariate dependencies through known distributional families, agent-based simulations construct data from interacting rule-driven entities, and physics-informed generators embed differential equation constraints. This makes the synthetic data more interpretable and auditable, with known theoretical properties, at the cost of reduced flexibility for capturing complex nonlinear patterns that lack a known generative model.

Example Use Cases

Safety

Using copula-based simulation to generate synthetic financial portfolio data preserving tail dependencies and correlation structures, enabling stress testing of risk models under extreme market conditions without relying solely on limited historical crash data.

Running agent-based traffic simulations to generate synthetic autonomous vehicle sensor data covering rare and dangerous driving scenarios (pedestrian occlusion, multi-vehicle pile-ups) that are impractical or unethical to collect in real-world testing.

Reliability

Employing Monte Carlo simulation to generate synthetic clinical trial outcome data for sensitivity analysis of treatment-effect estimators, quantifying how model predictions change across plausible data-generating processes.

Privacy

Fitting Gaussian copula models to census microdata to generate synthetic population datasets that preserve multivariate demographic relationships whilst providing formal disclosure limitation through the generation process.

Limitations

Requires explicit specification of the data-generating process, meaning that important but unknown relationships or latent factors not captured in the simulation model will be absent from the synthetic data.
Copula models assume that dependencies can be separated from marginal distributions, which may not hold for complex real-world datasets where the dependency structure itself varies across the marginal distribution.
Agent-based and physics-informed simulators require significant domain expertise to design, calibrate, and validate, making them resource-intensive compared to data-driven approaches.
Scalability is limited for high-dimensional problems — parametric models face the curse of dimensionality in specifying joint distributions, and agent-based simulations become computationally expensive as the number of interacting entities grows.

Resources

Research Papers

Copula Flows for Synthetic Data Generation

Sanket Kamthe, Samuel Assefa, and Marc Deisenroth•Jan 3, 2021

The ability to generate high-fidelity synthetic data is crucial when available (real) data is limited or where privacy and data protection standards allow only for limited use of the given data, e.g., in medical and financial data-sets. Current state-of-the-art methods for synthetic data generation are based on generative models, such as Generative Adversarial Networks (GANs). Even though GANs have achieved remarkable results in synthetic data generation, they are often challenging to interpret.Furthermore, GAN-based methods can suffer when used with mixed real and categorical variables.Moreover, loss function (discriminator loss) design itself is problem specific, i.e., the generative model may not be useful for tasks it was not explicitly trained for. In this paper, we propose to use a probabilistic model as a synthetic data generator. Learning the probabilistic model for the data is equivalent to estimating the density of the data. Based on the copula theory, we divide the density estimation task into two parts, i.e., estimating univariate marginals and estimating the multivariate copula density over the univariate marginals. We use normalising flows to learn both the copula density and univariate marginals. We benchmark our method on both simulated and real data-sets in terms of density estimation as well as the ability to generate high-fidelity synthetic data

synthpop: Bespoke Creation of Synthetic Data in R

Beata Nowok, Gillian M. Raab, and Chris Dibben•Oct 28, 2016

In many contexts, confidentiality constraints severely restrict access to unique and valuable microdata. Synthetic data which mimic the original observed data and preserve the relationships between variables but do not contain any disclosive records are one possible solution to this problem. The synthpop package for R, introduced in this paper, provides routines to generate synthetic versions of original data sets. We describe the methodology and its consequences for the data characteristics. We illustrate the package features using a survey data example.

The Synthetic Data Vault

Neha Patki, Roy Wedge, and Kalyan Veeramachaneni•Oct 1, 2016

The goal of this paper is to build a system that automatically creates synthetic data to enable data science endeavors. To achieve this, we present the Synthetic Data Vault (SDV), a system that builds generative models of relational databases. We are able to sample from the model and create synthetic data, hence the name SDV. When implementing the SDV, we also developed an algorithm that computes statistics at the intersection of related database tables. We then used a state-of-the-art multivariate modeling approach to model this data. The SDV iterates through all possible relations, ultimately creating a model for the entire database. Once this model is computed, the same relational information allows the SDV to synthesize data by sampling from any part of the database. After building the SDV, we used it to generate synthetic data for five different publicly available datasets. We then published these datasets, and asked data scientists to develop predictive models for them as part of a crowdsourced experiment. By analyzing the outcomes, we show that synthetic data can successfully replace original data for data science. Our analysis indicates that there is no significant difference in the work produced by data scientists who used synthetic data as opposed to real data. We conclude that the SDV is a viable solution for synthetic data generation.

An evaluation framework for synthetic data generation models

Ioannis E. Livieris et al.•Apr 13, 2024

Nowadays, the use of synthetic data has gained popularity as a cost-efficient strategy for enhancing data augmentation for improving machine learning models performance as well as addressing concerns related to sensitive data privacy. Therefore, the necessity of ensuring quality of generated synthetic data, in terms of accurate representation of real data, consists of primary importance. In this work, we present a new framework for evaluating synthetic data generation models' ability for developing high-quality synthetic data. The proposed approach is able to provide strong statistical and theoretical information about the evaluation framework and the compared models' ranking. Two use case scenarios demonstrate the applicability of the proposed framework for evaluating the ability of synthetic data generation models to generated high quality data. The implementation code can be found in https://github.com/novelcore/synthetic_data_evaluation_framework.

Software Packages

synthpop: Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control

Jul 12, 2025

A tool for producing synthetic versions of microdata containing confidential information so that they are safe to be released to users for exploratory analysis. The key objective of generating synthetic data is to replace sensitive original values with synthetic ones causing minimal distortion of the statistical information contained in the data set. Variables, which can be categorical or continuous, are synthesised one-by-one using sequential modelling. Replacements are generated by drawing from conditional distributions fitted to the original data using parametric or classification and regression trees models. Data are synthesised via the function syn() which can be largely automated, if default settings are used, or with methods defined by the user. Optional parameters can be used to influence the disclosure risk and the analytical quality of the synthesised data. For a description of the implemented method see Nowok, Raab and Dibben (2016) <doi:10.18637/jss.v074.i11>. Functions to assess identity and attribute disclosure for the original and for the synthetic data are included in the package, and their use is illustrated in a vignette on disclosure (Practical Privacy Metrics for Synthetic Data).

SDV

May 11, 2018

Synthetic data generation for tabular data

sdv-dev/Copulas

Feb 23, 2026

A library to model multivariate data using copulas.

Tutorials

Generating Synthetic Multivariate Data with Copulas

Zachary Warnes•Sep 3, 2021

Create more data with the same feature dependencies as your data

Related Techniques

Name	Description	Assurance Goals
Sensitivity Analysis for Fairness	Sensitivity Analysis for Fairness systematically evaluates how model predictions change when sensitive attributes or their proxies are perturbed whilst holding other factors constant. The technique involves creating counterfactual instances by modifying potentially discriminatory features (race, gender, age) or their correlates (zip code, names, education institutions) and measuring the resulting prediction differences. This controlled perturbation approach quantifies the degree to which protected characteristics influence model decisions, helping detect both direct discrimination and indirect bias through proxy variables even when sensitive attributes are not explicitly used as model inputs.	Fairness
GAN-Based Tabular Synthetic Data	Generates synthetic tabular datasets using Generative Adversarial Networks, most commonly through architectures such as CTGAN (Conditional Tabular GAN) and TVAE (Triplet-based Variational Autoencoder). These models learn the joint distribution of mixed-type columns — continuous, discrete, and categorical — by training a generator and discriminator in an adversarial framework, with mode-specific normalisation to handle multimodal continuous distributions and training-by-sampling to address class imbalance. The resulting synthetic tables aim to preserve statistical relationships, correlations, and marginal distributions of the original data whilst containing no real records, supporting privacy-preserving data sharing, model development on sensitive datasets, and augmentation of limited training data.	Privacy Reliability Fairness
Synthetic Data Evaluation	Synthetic data evaluation assesses whether synthetic datasets protect individual privacy while maintaining statistical utility and fidelity to real data. This technique evaluates three key dimensions: privacy (through disclosure risk metrics and re-identification attack success rates), utility (by comparing statistical properties and model performance), and fidelity (measuring distributional similarity to real data). It produces evaluation reports quantifying the privacy-utility-fidelity trade-offs.	Privacy Transparency Reliability