Synthetic data quality

The concept of a synthetic data twin explained

Syntho mimics (sensitive) data with AI to generate synthetic data twins

With a Synthetic Data Twin, Syntho aims for superior synthetic data quality that is on par with the original data. We do this this with our synthetic data software that uses state-of-the-art machine learning models. Those ML models generate completely new data points and models them in such a way that we preserve the characteristics, relationships and statistical patterns of the original data to such an extent that you can use it as-if it is original data. This is what we call a synthetic data twin – synthetic data with preserved characteristics, relationships and patterns, as seen in the original data.

Synthetic data generation with AI by the syntho engine

As-if it is original data?

How we evaluate generated synthetic data

quality report

Step 1

Quality report

Syntho offers a quality report for every generated synthetic dataset to demonstrate referential integrity, distributions, correlations, multivariate distributions and many more.

Step 2

External assessment by SAS

Here, SAS compares AI-generated synthetic data from Syntho with original datasets via various (AI) assessments on data quality, legal validity and usability.

sas

Our synthetic data quality report

Snapshots from our synthetic data quality report

Distributions

Synthetic data distributions
Distributions give insight in the frequency of a certain data record for a given category or value and are captured by the Syntho Engine. 

Correlations

Synthetic data correlations
Correlations provide insight in the degree to which two variables are related and are captured by the Syntho Engine.  

Multivariates

Synthetic data multivariate distributions
Multivariate distributions and correlations provide insight for combinations of categories and are also captured by the Syntho Engine.

External references

Synthetic data assessment by SAS

For the case study, the target dataset was a telecom dataset provided by SAS containing the data of 56.600 customers. The dataset contains 128 columns, including one column indicating whether a customer has left the company (i.e. ‘churned’) or not. The goal of the case study was to use the synthetic data to train some models to predict customer churn and to assess the performance of each model. As churn prediction is a classification task, SAS selected five popular classification models to make the predictions, including:

  1. Decision tree
  2. Random forest
  3. Gradient boosting
  4. Logistic regression
  5. Neural network

Before generating the synthetic data, SAS randomly split the telecom dataset into a train set (for training the models) and a holdout set (for scoring the models). Having a separate holdout set for scoring allows for an unbiased assessment of how well the classification model might do when applied to new data.

Using the train set as input, Syntho used its Syntho Engine to generate a synthetic dataset. For benchmarking, SAS also created an anonymized version of the train set after applying various anonymization techniques to reach a certain threshold (of k-anonymity). The former steps resulted into four datasets:

  1. A train dataset (i.e. the original dataset minus the holdout dataset)
  2. A holdout dataset (i.e. a subset of the original dataset)
  3. An anonymized dataset (based on the original dataset minus the holdout dataset)
  4. A synthetic dataset (based on the original dataset minus the holdout dataset)

Datasets 1, 3 and 4 were used to train each classification model, resulting in 15 (3 x 5) trained models. SAS subsequently used the holdout dataset to measure the accuracy with which each model predicts customer churn. The results are presented below, starting with some basic statistics.

Evaluation results

Results of the assessment by SAS

Synthetic data holds not only for basic patterns (as shown in the former plots), it also captures deep ‘hidden’ statistical patterns required for advanced analytics tasks. The latter is demonstrated in the bar chart, indicating that the accuracy of models trained on synthetic data versus models trained on original data are on par. Furthermore, with an area under the curve (AUC*) close to 0.5, the models trained on anonymized data perform by far the worst. The full report with all advanced analytics assessments on synthetic data in comparison with the original data is available on request.

Additionally, this synthetic data can be used to understand data characteristics and main variables needed for actual training of the modelst. The inputs selected by the algorithms on synthetic data compared to original data were very similar. Hence, the modeling process can be done on this synthetic version, which reduces risk of data breaches. However, when inferencing individual records (eg. telco customer) retraining on original data is recommended for explainability, increased acceptance or just because of regulation.    

Synthetic data quality

We do not have to worry about the data accuracy of synthetic data by Syntho”

Sas logo
Play Video about The data quality of Syntho and the generated synthetic data twins.

Conclusions

Conclusions of the assessment by SAS

syntho guide

Save your synthetic data guide now!