Evaluating Utility and Resemblance in Synthetic Data Generators: A Technical Deep Dive and Comparative Analysis

Blog

February 27, 2024

Mihai Anca, PhD Machine Learning Engineer

What are synthetic data generators?
How to choose the right synthetic data generation method?
Synthetic Data Evaluation Metrics
Comparative Analysis: Syntho Engine vs. Open-Source Alternatives
Results
Concluding Remarks and Future Directions

Organizations use data anonymization tools to remove personally identifiable information from their datasets. Non-compliance can lead to hefty fines from regulatory bodies and data breaches. Without anonymizing data, you can not utilize or share the datasets to the fullest.

Many anonymization tools can’t guarantee full compliance. Past-gen methods might leave personal information vulnerable to de-identification by malicious actors. Some statistical anonymization methods reduce the dataset quality to a point when it’s unreliable for data analytics.

We at Syntho will introduce you to the anonymization methods and the key differences between past-gen and next-gen tools. We’ll tell you about the best data anonymization tools and suggest the key considerations for choosing them.

Syntho Guide

Your guide into synthetic data generation

Download guide →

What are synthetic data generators?

Enter synthetic data, a solution to this conundrum. Synthetic data generators create datasets that mimic real user data while preserving anonymity and confidentiality. This approach is gaining traction across industries, from healthcare to finance, where privacy is paramount.

This post is tailored for data professionals and enthusiasts, focusing on the evaluation of synthetic data generators. We will delve into key metrics and conduct a comparative analysis between Syntho’s Engine and its open-source alternatives, offering insights on how to effectively assess the solution quality of synthetic data generation. Furthermore, we will also evaluate the time cost of each of these models to provide further insight into the working of the models.

How to choose the right synthetic data generation method?

In the diverse landscape of synthetic data generation, there’s an abundance of methods available, each vying for attention with its unique capabilities. Choosing the most suitable method for a particular application requires a thorough understanding of the performance characteristics of each option. This necessitates a comprehensive evaluation of various synthetic data generators based on a set of well-defined metrics to make an informed decision.

What follows is a rigorous comparative analysis of the Syntho Engine alongside a well-known open-source framework, the Synthetic Data Vault (SDV). In this analysis, we used many commonly used metrics such as statistical fidelity, predictive accuracy, and inter-variable relationship.

Synthetic Data Evaluation Metrics

Before introducing any specific metric, we must acknowledge that there are numerous ideologies about evaluating Synthetic data, each of which gives insight into a certain aspect of data. With this in mind, the following three categories stand out as being important and comprehensive. These metrics provide insights into various aspects of data quality. These categories are:

Statistical Fidelity Metrics: Examining basic statistical features of the data, like means and variances, to ensure the synthetic data aligns with the original dataset’s statistical profile.
Predictive Accuracy: Examining synthetic data generation model performance, trained with original data, and evaluated on synthetic data (Train Real – Test Synthetic, TRTS) and vice versa (Train Synthetic – Test Real, TSTR)
Inter-Variable Relationships: This combined category includes:

Feature Correlation: We assess how well the synthetic data maintains the relationships between variables using correlation coefficients. A well-known metric like the Propensity Mean Squared Error (PMSE) would be of this type.
Mutual Information: We measure the mutual dependences between variables to understand the depth of these relationships beyond just correlations.

Comparative Analysis: Syntho Engine vs. Open-Source Alternatives

The comparative analysis was conducted using a standardized evaluative framework and identical testing techniques across all models, including Syntho Engine and SDV models. By synthesizing datasets from identical sources and subjecting them to the same statistical tests and machine learning model assessments, we ensure a fair and unbiased comparison. The section that follows details the performance of each synthetic data generator across the range of metrics presented above.

As for the dataset used for the evaluation, we used the UCI Adult’s Census Dataset which is a well-known dataset in the machine learning community. We cleaned the data prior to all training and then split the dataset into two sets (a training and a holdout set for testing). We used the training set to generate 1 million new data points with each of the models and evaluated various metrics on these generated datasets. For further machine learning evaluations, we used the holdout set to evaluate metrics such as those related to TSTR and TRTS.

Each generator was run with default parameters. As some of the models, like Syntho, can work out-of-the-box on any tabular data, no fine tuning was done. Searching for the right hyperparameters for each model would take a significant amount of time, and Table 2 already shows a large time difference between Syntho’s model and the ones tested against.

It is noteworthy that as opposed to the rest of the models in SDV, the Gaussian Copula Synthesizer is based on statistical methods. In contrast, the rest are based on neural networks such as Generative Adversarial Networks (GAN) models and variational auto-encoders. This is why Gaussian Copula can be seen as a baseline for all the models discussed.

Results

Data Quality

The previously discussed adherences to trends and representations in the data can be found in Figure 1 and Table 1. Here, each of the metrics in use can be interpreted as follows:

Overall Quality Score: Overall assessment of synthetic data’s quality, combining various aspects like statistical similarity and data characteristics.
Column Shapes: Assess whether the synthetic data maintains the same distribution shape as the real data for each column.
Column Pair Trends: Evaluates relationships or correlations between pairs of columns in synthetic data compared to real data.

Overall, Syntho achieves very high scores across the board. To begin with, when looking at overall data quality (evaluated with the SDV metrics library) Syntho can achieve a result upwards of 99% (with column shape adherence of 99.92% and column pair shape adherence of 99.31%). This is whilst SDV gets a result of maximally 90.84% (with Gaussian Copula, having a column shape adherence of 93.82% and column pair shape adherence of 87.86%).

A-tabular-representation-of-the-quality-scores-of-each-generated-dataset-per-model

Data Coverage

The Diagnosis Report module of SDV brings to our attention that SDV-generated data (in all cases) is missing more than 10% of the numeric ranges; In the case of Triplet-Based Variational Autoencoder (TVAE), the same amount of categorical data is also missing when compared to the original dataset. No such warnings were generated with the results achieved by using Syntho.

visualization-of-average-column-wise-performance-metrics-for-all-models

In the comparative analysis, the plot of Figure 2 illustrates that SDV archives marginally better results in category coverage with some of their models (namely with GaussianCopula, CopulaGAN, and Conditional Tabular GAN – CTGAN). Nevertheless, it is important to highlight that the reliability of Syntho’s data surpasses that of SDV models, as the discrepancy in coverage across categories and ranges is minimal, exhibiting a mere 1.1% variance. In contrast, SDV models demonstrate a considerable variation, ranging from 14.6% to 29.2%.

The represented metrics here can be interpreted as follows:

Category Coverage: Measures the presence of all categories in synthetic data as compared to real data.
Range Coverage: Evaluates how well the range of values in synthetic data matches that in real data.

average-coverage-of-a-given-attribute-type-per-model

Utility

Moving on to the topic of the utility of synthetic data, the matter of training models on the data becomes relevant. To have a balanced and fair comparison between all frameworks we have chosen the default Gradient Boosting Classifier from the SciKit Learn library, seeing it is fairly accepted as a well-performing model with out-of-the-box settings.

Two different models are trained, one on the synthetic data (for TSTR) and one on the original data (for TRTS). The model trained on the synthetic data is evaluated by using a holdout test set (which was not used during synthetic data generation) and the model trained on original data is tested on the synthetic dataset.

AUC-scores-of-column-performance-metric-by-method

The results visualized above demonstrate the superiority of Synthetic data generation by the Syntho engine as compared to other methods, seeing there is no difference between the results obtained by the different methods (pointing towards a high similarity between the synthetic and real data). Also, the red dotted line present in the plot is the result obtained by evaluating the base performance of a Train Real, Test Real (TRTR) test to provide a baseline for the observed metrics. This line represents the value 0.92, which is the Area Under the Curve score (AUC score) achieved by the model trained on real data and tested on real data.

Time-wise comparison

Naturally, it is crucial to consider the time invested in generating these results. The visualization below illustrates just this.

Graph of time taken to train and perform synthetic data generation

Figure 5 illustrates the time taken to generate synthetic data in two different settings. The first of which (here referred to as Without GPU), were test runs run on a system with an Intel Xeon CPU with 16 cores running at 2.20 GHz. The tests marked as “ran with a GPU” were on a system with an AMD Ryzen 9 7945HX CPU with 16 cores running at 2.5GHz and a NVIDIA GeForce RTX 4070 Laptop GPU. As noticeable in Figure 2 and in Table 2 below, it can be observed that Syntho is significantly faster at generating synthetic data (in both scenarios) which is critical in a dynamic workflow.

Table of time taken to synthetic data generation of 1 million datapoints

Concluding Remarks and Future Directions

The findings underscore the importance of thorough quality evaluation in choosing the right synthetic data generation method. Syntho’s Engine, with its AI-driven approach, demonstrates noteworthy strengths in certain metrics, while open-source tools like SDV shine in their versatility and community-driven improvements.

As the field of synthetic data continues to evolve, we encourage you to apply these metrics in your projects, explore their intricacies, and to share your experiences. Stay tuned for future posts where we will dive deeper into other metrics and highlight real-world examples of their application.

At the end of the day, for those looking to test the waters on synthetic data, the presented open-source alternative can be a justifiable choice given accessibility; however, for professionals incorporating this modern technology into their development process, any chance at improvement must be taken and all hindrances avoided. It is therefore important to choose the best option available. With the analyses provided above it becomes rather apparent that Syntho and with that the Syntho Engine is a very capable tool for practitioners.

About Syntho

Syntho provides a smart synthetic data generation platform, leveraging multiple synthetic data forms and generation methods, empowering organizations to intelligently transform data into a competitive edge. Our AI-generated synthetic data mimics statistical patterns of original data, ensuring accuracy, privacy, and speed, as assessed by external experts like SAS. With smart de-identification features and consistent mapping, sensitive information is protected while preserving referential integrity. Our platform enables the creation, management, and control of test data for non-production environments, utilizing rule-based synthetic data generation methods for targeted scenarios. Additionally, users can generate synthetic data programmatically and obtain realistic test data to develop comprehensive testing and development scenarios with ease.

Do you want to learn more practical applications of synthetic data? Feel free to schedule a demo!