AI-generated Synthetic Data, easy and fast access to high quality data?

Blog

April 14, 2022

Marijn Vonk Chief Product Officer & Co-founder

AI-generated synthetic data in practice
Is data anonymization not a solution?
An introduction to the case study
Comparing anonymized data to original data
Comparing synthetic data with original data
AI-generated synthetic data and advanced analytics
Value-adding synthetic data use cases

AI-generated synthetic data in practice

Syntho, an expert in AI-generated synthetic data, aims to turn privacy by design into a competitive advantage with AI-generated synthetic data. They help organizations build a strong data foundation with easy and fast access to high-quality data and recently won the Philips Innovation Award.

However, synthetic data generation with AI is a relatively new solution that typically introduces frequently asked questions. To answer these, Syntho started a case study together with SAS, the market leader in Advanced Analytics and AI software.

In collaboration with the Dutch AI Coalition (NL AIC), they investigated the value of synthetic data by comparing AI-generated synthetic data generated by the Syntho Engine with original data via various assessments on data quality, legal validity, and usability.

Syntho Guide

Your guide into synthetic data generation

Download guide →

Is data anonymization not a solution?

Classic anonymization techniques have in common that they manipulate original data in order to hinder tracing back individuals. Examples are generalization, suppression, wiping, pseudonymization, data masking, and shuffling of rows & columns. You can find examples in the table below.

Those techniques introduce 3 key challenges:

They work differently per data type and per dataset, making them hard to scale. Furthermore, since they work differently, there will always be debate about which methods to apply and what combination of techniques are needed.
There is always a one-to-one relationship with the original data. This means that there will always be a privacy risk, especially due to all open datasets and available techniques to link those datasets.
They manipulate data and thereby destroy data in the process. This is especially devastating for AI tasks where “predictive power” is essential because bad-quality data will result in bad insights from the AI model (Garbage-in will result in garbage-out).

These points are also assessed via this case study.

An introduction to the case study

For the case study, the target dataset was a telecom dataset provided by SAS containing the data of 56.600 customers. The dataset contains 128 columns, including one column indicating whether a customer has left the company (i.e. ‘churned’) or not. The goal of the case study was to use the synthetic data to train some models to predict customer churn and to evaluate the performance of those trained models. As churn prediction is a classification task, SAS selected four popular classification models to make the predictions, including:

Random forest
Gradient boosting
Logistic regression
Neural network

Before generating the synthetic data, SAS randomly split the telecom dataset into a train set (for training the models) and a holdout set (for scoring the models). Having a separate holdout set for scoring allows for an unbiased assessment of how well the classification model might perform when applied to new data.

Using the train set as input, Syntho used its Syntho Engine to generate a synthetic dataset. For benchmarking, SAS also created a manipulated version of the train set after applying various anonymization techniques to reach a certain threshold (of k-anonymity). The former steps resulted in four datasets:

A training dataset (i.e. the original dataset minus the holdout dataset)
A holdout dataset (i.e. a subset of the original dataset)
An anonymized dataset (based on the training dataset)
A synthetic dataset (based on the training dataset)

Datasets 1, 3, and 4 were used to train each classification model, resulting in 12 (3 x 4) trained models. SAS subsequently used the holdout dataset to measure the accuracy with which each model predicts customer churn. The results are presented below, starting with some basic statistics.

Figure: Machine Learning pipeline generated in SAS Visual Data Mining and Machine Learning

Basic statistics when comparing anonymized data to original data

Anonymization techniques destroy even basic patterns, business logic, relationships, and statistics (as in the example below). Using anonymized data for basic analytics thus produces unreliable results. In fact, the poor quality of the anonymized data made it almost impossible to use it for advanced analytics tasks (e.g. AI/ML modeling and dashboarding).

performance of original data vs anonymized data

Basic statistics when comparing synthetic data with original data

Synthetic data generation with AI preserves basic patterns, business logic, relationships, and statistics (as in the example below). Using synthetic data for basic analytics thus produces reliable results. The key question is, does synthetic data hold for advanced analytics tasks (e.g. AI/ML modeling and dashboarding)?

performance of original data vs synthetic data

AI-generated synthetic data and advanced analytics

Synthetic data holds not only for basic patterns (as shown in the former plots), but it also captures deep ‘hidden’ statistical patterns required for advanced analytics tasks. The latter is demonstrated in the bar chart below, indicating that the accuracy of models trained on synthetic data versus models trained on original data is similar. Furthermore, with an area under the curve (AUC*) close to 0.5, the models trained on anonymized data perform by far the worst. The full report with all advanced analytics assessments on synthetic data in comparison with the original data is available on request.

*AUC: the area under the curve is a measure of the accuracy of advanced analytics models, taking into account true positives, false positives, false negatives ,and true negatives. 0,5 means that a model predicts randomly and has no predictive power and 1 means that the model is always correct and has full predictive power.

Additionally, this synthetic data can be used to understand data characteristics and the main variables needed for the actual training of the models. The inputs selected by the algorithms on synthetic data compared to the original data were very similar. Hence, the modeling process can be done on this synthetic version, which reduces the risk of data breaches. However, when inferencing individual records (eg. telco customer) retraining on original data is recommended for explainability, increased acceptance, or just because of regulation.

Figure: AUC by Algorithm grouped by Method

Conclusions:

Models trained on synthetic data compared to the models trained on original data show highly similar performance
Models trained on anonymized data with ‘classic anonymization techniques’ show inferior performance compared to models trained on the original data or synthetic data
Synthetic data generation is easy and fast because the technique works exactly the same per dataset and per data type.

Value-adding synthetic data use cases

Use case 1: Synthetic data for model development and advanced analytics

Having a strong data foundation with easy and fast access to usable, high-quality data is essential to developing models (e.g. dashboards [BI] and advanced analytics [AI & ML]). However, many organizations suffer from a suboptimal data foundation resulting in 3 key challenges:

Getting access to data takes ages due to (privacy) regulations, internal processes, or data silos
Classic anonymization techniques destroy data, making the data no longer suitable for analysis and advanced analytics (garbage in = garbage out)
Existing solutions are not scalable because they work differently per dataset and per data type and cannot handle large multi-table databases

Synthetic data approach: develop models with as-good-as-real synthetic data to:

Minimize the use of original data, without hindering your developers
Unlock personal data and have access to more data that was previously restricted (e.g. due to privacy)
Easy and fast data access to relevant data
Scalable solution that works the same for each dataset, datatype, and for massive databases

This allows organizations to build a strong data foundation with easy and fast access to usable, high-quality data to unlock data and leverage data opportunities.

Use case 2: smart synthetic test data for software testing, development, and delivery

Testing and development with high-quality test data is essential to deliver state-of-the-art software solutions. Using original production data seems obvious, but is not allowed due to (privacy) regulations. Alternative Test Data Management (TDM) tools introduce “legacy-by-design” in getting the test data right:

Do not reflect production data and business logic and referential integrity is not preserved
Works slow and time-consuming
Manual work is required

Synthetic data approach: Test and develop with AI-generated synthetic test data to deliver state-of-the-art software solutions start with:

Production-like data with preserved business logic and referential integrity
Easy and fast data generation with state-of-the-art AI
Privacy-by-design
Easy, fast, and agile

This allows organizations to test and develop with next-level test data to deliver state-of-the-art software solutions!

More information

Interested? For more information about synthetic data, visit the Syntho website or contact Wim Kees Janssen. For more information about SAS, visit www.sas.com or contact kees@syntho.ai.

In this use case, Syntho, SAS, and the NL AIC work together to achieve the intended results. Syntho is an expert in AI-generated synthetic data and SAS is a market leader in analytics and offers software for exploring, analyzing,g and visualizing data.

*Predicts 2021 – Data and Analytics Strategies to Govern, Scale and Transform Digital Business, Gartner, 2020.