Synthetic Data Evaluation

Description

Synthetic data evaluation assesses whether synthetic datasets protect individual privacy while maintaining statistical utility and fidelity to real data. This technique evaluates three key dimensions: privacy (through disclosure risk metrics and re-identification attack success rates), utility (by comparing statistical properties and model performance), and fidelity (measuring distributional similarity to real data). It produces evaluation reports quantifying the privacy-utility-fidelity trade-offs.

Example Use Cases

Privacy

Validating synthetic patient data generated for medical research to ensure individual patients cannot be re-identified while maintaining statistical relationships needed for valid clinical studies.

Reliability

Validating that machine learning models for predicting student outcomes trained on synthetic educational data maintain reliable performance comparable to models trained on real student records, while enabling researchers to share datasets without FERPA violations.

Ensuring fraud detection models trained on synthetic credit card transactions maintain reliable performance comparable to models trained on sensitive real transaction data.

Limitations

  • Trade-off between privacy and utility means strong privacy guarantees often significantly degrade data quality and analytical value.
  • Difficult to validate that synthetic data protects against all possible privacy attacks, especially sophisticated adversaries with auxiliary information.
  • Utility metrics may not capture subtle distributional differences that matter for specific downstream tasks or edge case analyses.
  • Synthetic data may introduce artificial patterns or miss rare but important real-world phenomena, limiting use for certain applications.
  • Requires significant domain expertise to properly validate fidelity and utility for specific use cases, as generic statistical metrics may not capture domain-specific requirements or failure modes.
  • Synthetic data may not preserve fairness properties or bias patterns from original data in predictable ways, requiring careful fairness testing when synthetic data is used to train decision-making models.

Resources

Research Papers

Privacy Mechanisms and Evaluation Metrics for Synthetic Data Generation: A Systematic Review
Pablo A. Osorio-Marulanda et al.Jan 1, 2024

The growth of data publishing, sharing, and mining mechanisms in various fields of industry and science has led to an increase in the flow of data, making it an important asset that needs to be protected and managed effectively. To this end, different mechanisms have been used across different domains, including Privacy Enhancing Technologies like Synthetic Data Generation, which aim to protect user-sensitive data and prevent misuse among different domains. Then, Synthetic data has been used not only to augment datasets and balance classes but also in applications of data analysis paradigms that aim to provide useful insights in terms of utility while preserving the privacy of sensitive data. Still, there is a gap in the conceptual and state-of-the-art understanding of the level of privacy synthetic data generators can provide and how they affect various industries and fields. This systematic review attempts to address how privacy has been assessed and measured in the framework of synthetic data generation, and getting to know which metrics have been used to evaluate those mechanisms. We provide an overview with a total of 105 recent studies in this field after a screening process and identify future open research directions. The main findings include a high prevalence of differential privacy as a privacy-preserving technique and privacy budget cost as a trade-off metric, with a high percentage of GAN-based model implementations, and mainly healthcare applications. Our systematic review covers multiple privacy domains and can be understood as a general framework for privacy measurement applied in Synthetic Data Generation.

Can We Trust Synthetic Data in Medicine? A Scoping Review of Privacy and Utility Metrics
B. Kaabachi et al.Jan 1, 2023

Introduction: Sharing and re-using health-related data beyond the scope of its initial collection is essential for accelerating research, developing robust and trustworthy machine learning algorithms methods that can be translated into clinical settings. The sharing of synthetic data, artificially generated to resemble real patient data, is increasingly recognized as a promising means to enable such a re-use while addressing the privacy concerns related to personal medical data. Nonetheless, no consensus exists yet on a standard approach for systematically and quantitatively evaluating the actual privacy gain and residual utility of synthetic data, de-facto hindering its adoption. Objective: In this work, we present and systematize current knowledge on the field of synthetic health-related data evaluation both in terms of privacy and utility. We provide insights and critical analysis into the current state of the art and propose concrete directions and steps forward for the research community. Methods: We assess and contextualize existing knowledge in the field through a scoping review and the creation of a common ontology that encompasses all the methods and metrics used to assess synthetic data. We follow the PRISMA-ScR methodology in order to perform data collection and knowledge synthesis. Results: We include 92 studies in the scoping review. We analyze and classify them according to the proposed ontology. We found 48 different methods to evaluate the residual statistical utility of synthetic data and 9 methods that are used to evaluate the residual privacy risks. Moreover, we observe that there is currently no consensus among researchers regarding neither individual metrics nor family of metrics for evaluating the privacy and utility of synthetic data. Our findings on the privacy of synthetic data show that there is an alarming tendency to trust the safety of synthetic data without properly evaluating it. Conclusion: Although the use of synthetic data in healthcare promises to offer an easy and hassle-free alternative to real data, the lack of consensus in terms of evaluation hinders the adoption of this new technology. We believe that, by raising awareness and providing a comprehensive taxonomy on evaluation methods that takes into account the current state of literature, our work can foster the development and adoption of uniform approaches and consequently facilitate the use of synthetic data in the medical domain.

Software Packages

SynEval
Apr 18, 2024

An evaluation framework that assesses the fidelity, utility, diversity, and privacy of synthetic tabular data generated by large language models.

Tutorials

Evaluating synthetic data | Towards Data Science
Aymeric FloyracOct 14, 2024

Documentations

Welcome to TAPAS's documentation! — tapas 0.1 documentation
Tapas-privacy DevelopersJan 1, 2023

Tags