Evaluation of our synthetic data by
SAS data experts

Our synthetic data is assessed and
approved by the data experts of SAS

Conclusions by the data
experts of SAS

Syntho’s synthetic data has been rigorously evaluated and approved by SAS data experts, affirming its accuracy and usability.

Synthetic vs. Original Performance

Models trained on synthetic data compared to the models trained on original data show highly similar performance

Anonymized Data Performance Gap

Models trained on anonymized data with ‘classic anonymization techniques’ show inferior performance compared to models trained on the original data or synthetic data

Fast Synthetic Data Generation

Synthetic data generation is easy and fast because the technique works exactly the same per dataset and per data type

Initial results of the data assessment by SAS

Models trained on synthetic data score
highly similar in comparison to models
trained on original data

The AI algorithm learns patterns and relationships from real-world data to generate new, synthetic data that mimics these characteristics closely. This synthetic data is so accurate that it can be used for advanced analytics, acting as a “synthetic data twin” that functions like real-world data.

Why do models trained on anonymized data score worse?

Classic anonymization techniques have in common that they manipulate original data in order to hinder tracing back individuals. They manipulate data and thereby destroy data in the process. The more you anonymize, the better your data is protected, but also the more your data is destroyed.

This is especially devastating for AI and modeling tasks where “predictive power” is essential, because bad quality data will result in bad insights from the AI model. SAS demonstrated this, with an area under the curve (AUC*) close to 0.5, demonstrating that the models trained on anonymized data perform by far the worst.

Why do models trained on anonymized data score worse?

What did SAS do during this assessment?

Synthetic data generated by Syntho is assessed, validated and approved from an external and objective point of view by the data experts of SAS.

Telecom Data as Target

We used telecom data for “churn” prediction, focusing on how synthetic data could be utilized to train models and assess their performance.

Model Selection

SAS selected popular classification models for the prediction:
Random forest

Gradient boosting
Logistic regression
Neural network

Data Splitting

Before generating synthetic data, the telecom dataset was randomly split into:

Train Set: Used for training the models.
Holdout Set: Used for unbiased model scoring.

Generating Synthetic and Anonymized Data

Syntho generated a synthetic dataset using the train set. Additionally, SAS created an anonymized dataset using the same data, resulting in four datasets:

Original Train Dataset
Holdout Dataset
Anonymized Dataset
Synthetic Dataset

Model Training

Each dataset (original, anonymized, and synthetic) was used to train the churn prediction models. This resulted in a total of 12 trained models (3 datasets x 4 models). The models were trained using their respective datasets to evaluate how well they could predict churn outcomes. After training, the models’ accuracy was assessed using the holdout dataset to ensure unbiased performance evaluation across all models and datasets.

Model Performance Evaluation

SAS evaluated the accuracy of each model using the holdout dataset, measuring the predictive performance of customer churn. They also conducted detailed evaluations of data accuracy, privacy protection, and usability, concluding that Syntho’s synthetic data was accurate, secure, and usable compared to the original data.

Additional results of synthetic data assessments by SAS

Synthetic data generated by Syntho is assessed, validated and<br>approved from an external and objective point of view by the data experts of SAS.

Correlations

Correlations and relationships

The correlations and relationships between variables were accurately preserved in synthetic data.

Area Under the Curve (AUC)

The Area Under the Curve (AUC), a metric for measuring model performance, remained consistent.

Area under the curve model metric for measuring model performance

Variable importance

The variable importance, which indicated the predictive power of variables in a model, remained intact when comparing synthetic data to the original dataset.

Synthetic data for the The Netherlands Chamber of Commerce (KVK)

Synthetic data for software development and testing with a leading Dutch Bank

Synthetic test and development data with a leading EMR and healthcare solutions

Explore more resources

Mimic (sensitive) data with AI to generate synthetic data twins

All resources

Blog

Evaluating Utility and Resemblance in Synthetic Data Generators: A Technical Deep Dive and Comparative Analysis

Comparing Syntho’s performance with open-source synthetic data generators: Which offers superior results? Dive into our benchmarking analysis to find out.