For the case study, the target dataset was a telecom dataset provided by SAS containing the data of 56.600 customers. The dataset contains 128 columns, including one column indicating whether a customer has left the company (i.e. ‘churned’) or not. The goal of the case study was to use the synthetic data to train some models to predict customer churn and to assess the performance of each model. As churn prediction is a classification task, SAS selected five popular classification models to make the predictions, including:
- Decision tree
- Random forest
- Gradient boosting
- Logistic regression
- Neural network
Before generating the synthetic data, SAS randomly split the telecom dataset into a train set (for training the models) and a holdout set (for scoring the models). Having a separate holdout set for scoring allows for an unbiased assessment of how well the classification model might do when applied to new data.
Using the train set as input, Syntho used its Syntho Engine to generate a synthetic dataset. For benchmarking, SAS also created an anonymized version of the train set after applying various anonymization techniques to reach a certain threshold (of k-anonymity). The former steps resulted into four datasets:
- A train dataset (i.e. the original dataset minus the holdout dataset)
- A holdout dataset (i.e. a subset of the original dataset)
- An anonymized dataset (based on the original dataset minus the holdout dataset)
- A synthetic dataset (based on the original dataset minus the holdout dataset)
Datasets 1, 3 and 4 were used to train each classification model, resulting in 15 (3 x 5) trained models. SAS subsequently used the holdout dataset to measure the accuracy with which each model predicts customer churn. The results are presented below, starting with some basic statistics.