AI-generated Synthetic Data, easy and fast access to high quality data?
AI generated synthetic data in practice
Syntho, an expert in AI-generated synthetic data, aims to turn privacy by design into a competitive advantage with AI-generated synthetic data. They help organizations to build a strong data foundation with easy and fast access to high quality data and recently won the Philips Innovation Award.
However, synthetic data generation with AI is a relatively new solution that typically introduces frequently asked questions. To answer these, Syntho started a case-study together with SAS, market leader in Advanced Analytics and AI software.
In collaboration with the Dutch AI Coalition (NL AIC), they investigated the value of synthetic data by comparing AI-generated synthetic data generated by the Syntho Engine with original data via various assessments on data quality, legal validity and usability.
Is data anonymization not a solution?
Classic anonymization techniques have in common that they manipulate original data in order to hinder tracing back individuals. Examples are generalization, suppression, wiping, pseudonymization, data masking, and shuffling of rows & columns. You can find examples in the table below.
Those techniques introduce 3 key challenges:
- They work differently per data type and per dataset, making them hard to scale. Furthermore, since they work differently, there will always be debate about which methods to apply and what combination of techniques are needed.
- There is always a one-to-one relationship with the original data. This means that there will always be a privacy risk, especially due to all open datasets and available techniques to link those datasets.
- They manipulate data and thereby destroy data in the process. This is especially devastating for AI tasks where “predictive power” is essential, because bad quality data will result in bad insights from the AI model (Garbage-in will result in garbage-out).
These points are also assessed via this case study.
An introduction to the case study
For the case study, the target dataset was a telecom dataset provided by SAS containing the data of 56.600 customers. The dataset contains 128 columns, including one column indicating whether a customer has left the company (i.e. ‘churned’) or not. The goal of the case study was to use the synthetic data to train some models to predict customer churn and to evaluate the performance of those trained models. As churn prediction is a classification task, SAS selected four popular classification models to make the predictions, including:
- Random forest
- Gradient boosting
- Logistic regression
- Neural network
Before generating the synthetic data, SAS randomly split the telecom dataset into a train set (for training the models) and a holdout set (for scoring the models). Having a separate holdout set for scoring allows for an unbiased assessment of how well the classification model might perform when applied to new data.
Using the train set as input, Syntho used its Syntho Engine to generate a synthetic dataset. For benchmarking, SAS also created a manipulated version of the train set after applying various anonymization techniques to reach a certain threshold (of k-anonimity). The former steps resulted into four datasets:
- A train dataset (i.e. the original dataset minus the holdout dataset)
- A holdout dataset (i.e. a subset of the original dataset)
- An anonymized dataset (based on the train dataset)
- A synthetic dataset (based on the train dataset)
Datasets 1, 3 and 4 were used to train each classification model, resulting in 12 (3 x 4) trained models. SAS subsequently used the holdout dataset to measure the accuracy with which each model predicts customer churn. The results are presented below, starting with some basic statistics.
Figure: Machine Learning pipeline generated in SAS Visual Data Mining and Machine Learning
Basic statistics when comparing anonymized data to original data
Anonymization techniques destroy even basic patterns, business logic, relationships and statistics (as in the example below). Using anonymized data for basic analytics thus produces unreliable results. In fact, the poor quality of the anonymized data made it almost impossible to use it for advanced analytics tasks (e.g. AI/ML modelling and dashboarding).
Basic statistics when comparing synthetic data with original data
Synthetic data generation with AI preserves basic patterns, business logic, relationships and statistics (as in the example below). Using synthetic data for basic analytics thus produces reliable results. Key question, does synthetic data holds for advanced analytics tasks (e.g. AI/ML modelling and dashboarding)?
AI-generated synthetic data and advanced analytics
Synthetic data holds not only for basic patterns (as shown in the former plots), it also captures deep ‘hidden’ statistical patterns required for advanced analytics tasks. The latter is demonstrated in the bar chart below, indicating that the accuracy of models trained on synthetic data versus models trained on original data are similar. Furthermore, with an area under the curve (AUC*) close to 0.5, the models trained on anonymized data perform by far the worst. The full report with all advanced analytics assessments on synthetic data in comparison with the original data is available on request.
*AUC: the area under the curve is a measure for the accuracy of advanced analytics models, taking into account true positives, false positive, false negatives and true negatives. 0,5 means that a models predicts randomly and has no predictive power and 1 means that the model is always correct and has full predictive power.
Additionally, this synthetic data can be used to understand data characteristics and main variables needed for actual training of the models. The inputs selected by the algorithms on synthetic data compared to original data were very similar. Hence, the modeling process can be done on this synthetic version, which reduces risk of data breaches. However, when inferencing individual records (eg. telco customer) retraining on original data is recommended for explainability, increased acceptance or just because of regulation.
AUC by Algorithm grouped by Method
- Models trained on synthetic data compared to the models trained on original data show highly similar performance
- Models trained on anonymized data with ‘classic anonymization techniques’ show inferior performance compared to models trained on the original data or synthetic data
- Synthetic data generation is easy and fast because the technique works exactly the same per dataset and per data type.
Value-adding synthetic data use cases
Use case 1: Synthetic data for model development and advanced analytics
Having a strong data foundation with easy and fast access to usable, high quality data is essential to develop models (e.g. dashboards [BI] and advanced analytics [AI & ML]). However, many organizations suffer from a suboptimal data foundation resulting in 3 key challenges:
- Getting access to data takes ages due to (privacy) regulations, internal processes or data silos
- Classic anonymization techniques destroy data, making the data no longer suitable for analysis and advanced analytics (garbage in = garbage out)
- Existing solutions are not scalable because they work differently per dataset and per data type and cannot handle large multi-table databases
Synthetic data approach: develop models with as-good-as-real synthetic data to:
- Minimize the use of original data, without hindering your developers
- Unlock personal data and have access to more data that was previously restricted (e.g. due to privacy)
- Easy and fast data access to relevant data
- Scalable solution that works the same for each dataset, datatype and for massive databases
This allows organization to build a strong data foundation with easy and fast access to usable, high quality data to unlock data and to leverage data opportunities.
Use case 2: smart synthetic test data for software testing, development and delivery
Testing and development with high quality test data is essential to deliver state-of-the-art software solutions. Using original production data seems obvious, but is not allowed due to (privacy) regulations. Alternative Test Data Management (TDM) tools introduce “legacy-by-design” in getting the test data right:
- Do not reflect production data and business logic and referential integrity are not preserved
- Work slow and time consuming
- Manual work is required
Synthetic data approach: Test and develop with AI-generated synthetic test data to deliver state-of-the-art software solutions smart with:
- Production-like data with preserved business logic and referential integrity
- Easy and fast data generation with state-of-the art AI
- Easy, fast and agile
This allows organization to test and develop with next-level test data to deliver state-of-the-art software solutions!
Interested? For more information about synthetic data, visit the Syntho website or contact Wim Kees Janssen. For more information about SAS, visit www.sas.com or contact email@example.com.
In this use case, Syntho, SAS and the NL AIC work together to achieve the intended results. Syntho is an expert in AI-generated synthetic data and SAS is a market leader in analytics and offers software for exploring, analyzing and visualizing data.
* Predicts 2021 – Data and Analytics Strategies to Govern, Scale and Transform Digital Business, Gartner, 2020.