Synthetic data quality
Explore our data quality assurance report with the results assessed by the SAS data experts
The Benefits of Synthetic Data Twins
Syntho mimics (sensitive) data with AI to generate synthetic data twins
Syntho utilizes advanced artificial intelligence (AI) to create synthetic data that accurately mimic sensitive data. Our goal is to generate synthetic data of the highest accuracy, in comparison to the original data. Using our Syntho Engine software and cutting-edge machine learning models, we can generate entirely new data points while maintaining the same statistical patterns and relationships found in the original data. The result is synthetic data that preserves the key characteristics of the original data, making it indistinguishable from real data, so that it even can be used for analytics. Hence, we name this AI generated synthetic data a synthetic data twin, because it is “as-good-as-real in comparison to the real data. By leveraging synthetic data twins, businesses can unlock numerous benefits via various value adding synthetic data use cases.
Key question: how accurate is the synthetic data in comparison to the real data?
Syntho’s Quality Assurance (QA) Report
At Syntho, we understand the importance of reliable and accurate data for your business. That’s why we provide a comprehensive quality assurance report for every synthetic data run, that demonstrates the accuracy of the synthetic data compared to the original data. Our quality report includes various metrics such as distributions, correlations, multivariate distributions, privacy metrics, and more. This way, you can easily assess that the synthetic data we provide is of the highest quality and can be used with the same level of accuracy and reliability as your original data.
External assessment by the data experts of SAS
Though Syntho is proud to offer its users an advanced quality assurance report, which is generated automatically by our Syntho Engine, we also understand the importance of having an external, objective evaluation of our synthetic data. That’s why we enlisted the help of SAS, a leading data expert, to assess our synthetic data.
SAS conducts various thorough evaluations on data-accuracy, privacy protection, and usability of Syntho’s AI-generated synthetic data in comparison to the original data. As conclusion, SAS assessed and approved Syntho’s synthetic data as being accurate, secure, and usable in comparison to the original data.
Syntho's Quality Assurance (QA) Report
Snapshots from our synthetic data quality report
Synthetic data assessment by SAS
For the case study, the target dataset was a telecom dataset. The dataset contains 128 columns, including one column indicating whether a customer has left the company (i.e. ‘churned’) or not. The goal of the case study was to use the synthetic data to train some models to predict customer churn and to assess the performance of each model. As churn prediction is a classification task, SAS selected five popular classification models to make the predictions, including:
- Random forest
- Gradient boosting
- Logistic regression
- Neural network
Before generating the synthetic data, SAS randomly split the telecom dataset into a train set (for training the models) and a holdout set (for scoring the models). Having a separate holdout set for scoring allows for an unbiased assessment of how well the classification model might do when applied to new data.
Using the train set as input, Syntho used its Syntho Engine to generate a synthetic dataset. For benchmarking, SAS also created an anonymized version of the train set after applying various anonymization techniques to reach a certain threshold (of k-anonymity). The former steps resulted into four datasets:
- A train dataset (i.e. the original dataset minus the holdout dataset)
- A holdout dataset (i.e. a subset of the original dataset)
- An anonymized dataset (based on the original dataset minus the holdout dataset)
- A synthetic dataset (based on the original dataset minus the holdout dataset)
Datasets 1, 3 and 4 were used to train each classification model, resulting in 12 (3 x 4) trained models. SAS subsequently used the holdout dataset to measure the accuracy with which each model predicts customer churn. The results are presented below, starting with some basic statistics.
Our synthetic data is approved by the data experts of SAS
Results of the assessment by SAS
Synthetic data holds not only for basic patterns (as shown in the former plots of the Syntho QA report), it also captures deep ‘hidden’ statistical patterns required for advanced analytics tasks. The latter is demonstrated in the bar chart, indicating that the accuracy of models trained on synthetic data versus models trained on original data are on par. Furthermore, with an area under the curve (AUC*) close to 0.5, the models trained on anonymized data perform by far the worst. The full report with all advanced analytics assessments on synthetic data in comparison with the original data is available on request.
Additionally, this synthetic data can be used to understand data characteristics and main variables needed for actual training of the models. The inputs selected by the algorithms on synthetic data compared to original data were very similar. Hence, the modeling process can be done on this synthetic version, which reduces risk of data breaches.
Additional results of an assessment by SAS for a leading hospital
Does synthetic data match the accuracy of real data?
The correlations and relationships between variables were accurately preserved in synthetic data.
The Area Under the Curve (AUC), a metric for measuring model performance, remained consistent.
Furthermore, the variable importance, which indicated the predictive power of variables in a model, remained intact when comparing synthetic data to the original dataset.
Based on these observations, we can confidently conclude that synthetic data generated by the Syntho Engine in SAS Viya is indeed on par with real data in terms of quality. This validates the use of synthetic data for model development, paving the way for cancer research focused on predicting deterioration and mortality.
Conclusions of the assessment by SAS