Anonymized data vs Synthetic data
If you anonymize your data before performing data testing of data analytics, there are several factors at play:
- In almost all cases, anonymized data can still be traced back to individuals due to specific and unique rows (e.g. medical records)
- The more you anonymize or generalize, the more data you destroy. This lowers the quality of your data and thus your insights
- Anonymization works differently for different data formats. This means it is not scalable and can be very time-consuming
Synthetic data solves all of these shortcomings and more. Watch the video below to see an analytics expert from SAS (global market leader in analytics) explain about his assessment on the difference in quality between original data, anonymized data and by Syntho generated synthetic data.
Edwin van Unen sent an original dataset to Syntho and we synthesized the dataset. But the question was also: “What will happen if we compare synthetic data to anonymized data?” Because you lose a lot of information in an anonymized data, will this also happen when synthesizing a dataset? We started with a dataset from the telecommunications industry with 56.000 rows and 128 columns of company churn-information. This dataset was both synthesized and anonymized so Edwin could compare synthetization with anonymization. Then, Edwin started modeling using SAS Viya. He built a couple of churn models on the original dataset, using classical regression techniques and decision trees, but also more sophisticated techniques such as neural networks, gradient boosting, random forest – these kinds of techniques. Using the standard SAS Viya options when building the models.
Then, it was time to look at the results. The results were very promising for synthetic data and not for anonymization. For the none-machine-learning experts in the audience, we look at the area under the ROC-curve which tells something about the accuracy of the model. Comparing the original data to the anonymized data, we see that the original data model has an area under the ROC-curve of .8, which is pretty good, However, the anonymized data has an area under the ROC-curve of .6. This means we lose a lot of information with the anonymized model so you lose a lot of predictive power.
But then, the question is what about synthetics data? Here, we did exactly the same but instead of anonymizing the data, Syntho synthesized the data. Now, we see both the original data and the synthetic data have an area under the ROC-curve of .8, which is very similar. Not exactly the same due to variability, but very similar. This means, the potential of synthetic data is very promising – Edwin is very happy about this.
Data is synthetic, but our team is real!
Contact Syntho and one of our experts will get in touch with you at the speed of light to explore the value of synthetic data!