GAN-Based Tabular Synthetic Data

Description

Generates synthetic tabular datasets using Generative Adversarial Networks, most commonly through architectures such as CTGAN (Conditional Tabular GAN) and TVAE (Triplet-based Variational Autoencoder). These models learn the joint distribution of mixed-type columns — continuous, discrete, and categorical — by training a generator and discriminator in an adversarial framework, with mode-specific normalisation to handle multimodal continuous distributions and training-by-sampling to address class imbalance. The resulting synthetic tables aim to preserve statistical relationships, correlations, and marginal distributions of the original data whilst containing no real records, supporting privacy-preserving data sharing, model development on sensitive datasets, and augmentation of limited training data.

Example Use Cases

Privacy

Generating synthetic patient records from hospital databases using CTGAN to enable multi-institution collaborative research on treatment outcomes without sharing identifiable health data across organisational boundaries.

Reliability

Augmenting a limited fraud detection training set by generating synthetic fraudulent transaction records that preserve the statistical characteristics of rare fraud patterns, improving model recall on underrepresented fraud types.

Fairness

Creating balanced synthetic datasets for credit scoring by generating additional records for underrepresented demographic groups, enabling fairness testing across protected characteristics without collecting new sensitive data.

Limitations

  • Training instability inherent to GAN architectures (mode collapse, vanishing gradients) can result in synthetic data that fails to capture the full diversity of the original distribution, particularly for rare categories or tail distributions.
  • Requires substantial computational resources and hyperparameter tuning, with training times scaling poorly for datasets with many columns or complex inter-column dependencies.
  • Privacy guarantees are not inherent — without additional mechanisms like differential privacy, GAN-generated data may leak information about training records through memorisation or membership inference attacks.
  • Performance degrades significantly on datasets with high cardinality categorical columns, complex temporal dependencies, or relational structures across multiple tables.

Resources

Research Papers

Modeling Tabular data using Conditional GAN
Lei Xu et al.Oct 28, 2019

Modeling the probability distribution of rows in tabular data and generating realistic synthetic data is a non-trivial task. Tabular data usually contains a mix of discrete and continuous columns. Continuous columns may have multiple modes whereas discrete columns are sometimes imbalanced making the modeling difficult. Existing statistical and deep neural network models fail to properly model this type of data. We design TGAN, which uses a conditional generative adversarial network to address these challenges. To aid in a fair and thorough comparison, we design a benchmark with 7 simulated and 8 real datasets and several Bayesian network baselines. TGAN outperforms Bayesian methods on most of the real datasets whereas other deep learning methods could not.

The Synthetic Data Vault
Neha Patki, Roy Wedge, and Kalyan VeeramachaneniOct 1, 2016

The goal of this paper is to build a system that automatically creates synthetic data to enable data science endeavors. To achieve this, we present the Synthetic Data Vault (SDV), a system that builds generative models of relational databases. We are able to sample from the model and create synthetic data, hence the name SDV. When implementing the SDV, we also developed an algorithm that computes statistics at the intersection of related database tables. We then used a state-of-the-art multivariate modeling approach to model this data. The SDV iterates through all possible relations, ultimately creating a model for the entire database. Once this model is computed, the same relational information allows the SDV to synthesize data by sampling from any part of the database. After building the SDV, we used it to generate synthetic data for five different publicly available datasets. We then published these datasets, and asked data scientists to develop predictive models for them as part of a crowdsourced experiment. By analyzing the outcomes, we show that synthetic data can successfully replace original data for data science. Our analysis indicates that there is no significant difference in the work produced by data scientists who used synthetic data as opposed to real data. We conclude that the SDV is a viable solution for synthetic data generation.

An evaluation framework for synthetic data generation models
Ioannis E. Livieris et al.Apr 13, 2024

Nowadays, the use of synthetic data has gained popularity as a cost-efficient strategy for enhancing data augmentation for improving machine learning models performance as well as addressing concerns related to sensitive data privacy. Therefore, the necessity of ensuring quality of generated synthetic data, in terms of accurate representation of real data, consists of primary importance. In this work, we present a new framework for evaluating synthetic data generation models' ability for developing high-quality synthetic data. The proposed approach is able to provide strong statistical and theoretical information about the evaluation framework and the compared models' ranking. Two use case scenarios demonstrate the applicability of the proposed framework for evaluating the ability of synthetic data generation models to generated high quality data. The implementation code can be found in https://github.com/novelcore/synthetic_data_evaluation_framework.

Bt-GAN: Generating Fair Synthetic Healthdata via Bias-transforming Generative Adversarial Networks
R. Ramachandranpillai et al.Jan 1, 2024

Synthetic data generation offers a promising solution to enhance the usefulness of Electronic Healthcare Records (EHR) by generating realistic de-identified data. However, the existing literature primarily focuses on the quality of synthetic health data, neglecting the crucial aspect of fairness in downstream predictions. Consequently, models trained on synthetic EHR have faced criticism for producing biased outcomes in target tasks. These biases can arise from either spurious correlations between features or the failure of models to accurately represent sub-groups. To address these concerns, we present Bias-transforming Generative Adversarial Networks (Bt-GAN), a GAN-based synthetic data generator specifically designed for the healthcare domain. In order to tackle spurious correlations (i), we propose an information-constrained Data Generation Process (DGP) that enables the generator to learn a fair deterministic transformation based on a well-defined notion of algorithmic fairness. To overcome the challenge of capturing exact sub-group representations (ii), we incentivize the generator to preserve sub-group densities through score-based weighted sampling. This approach compels the generator to learn from underrepresented regions of the data manifold. To evaluate the effectiveness of our proposed method, we conduct extensive experiments using the Medical Information Mart for Intensive Care (MIMIC-III) database. Our results demonstrate that Bt-GAN achieves state-of-the-art accuracy while significantly improving fairness and minimizing bias amplification. Furthermore, we perform an in-depth explainability analysis to provide additional evidence supporting the validity of our study. In conclusion, our research introduces a novel and professional approach to addressing the limitations of synthetic data generation in the healthcare domain. By incorporating fairness considerations and leveraging advanced techniques such as GANs, we pave the way for more reliable and unbiased predictions in healthcare applications.

Software Packages

SDV
May 11, 2018

Synthetic data generation for tabular data

sdv-dev/CTGAN
Feb 25, 2026

Conditional GAN for generating synthetic tabular data.

Tutorials

How to Generate Real-World Synthetic Data with CTGAN | Towards ...
Miriam SantosApr 13, 2023

Tags