Synthetic Data Quality Report

Syntho develops software for AI generated synthetic data. As opposed to using sensitive original data, customers use our AI software to create top-notch synthetic data. We generate entirely new data, but we are able to model those new datapoints to preserve the characteristics, relationships and statistical patterns of the original data to such an extent that it can be used as-if it is original data. Syntho offers a quality report for every generated synthetic dataset to demonstrate this. Our quality report contains various basic statistics, including aggregates, distributions and correlations, enriched with more advanced measures, such as multivariate distributions. This blog illustrates highlights from our data quality report based on a case-study quality with the public available Census Income Database (1994)

quality report

Synthetic data case study: the public available Census Income Database (1994)

This case study demonstrates highlights from our quality report, containing various statistics from synthetic data generated through our Syntho Engine in comparison to the original data. We practice with the publicly available income dataset from the Census Income Database (1994). This dataset contains various personal attributes (age, education, relationship etc.) enriched with income-related data (hours-per-week, income etc.). The various categories from this original dataset supplemented with example data records are found in the table below.

The original Census Income Database (1994)

Example data records

id 1 2 3
age 39 50 38
workclass  State-gov  Self-emp-not-inc  Private
fnlwgt 77516 83311 215646
education  Bachelors  Bachelors  HS-grad
education-num 13 13 9
marital-status  Never-married  Married-civ-spouse  Divorced
occupation  Adm-clerical  Exec-managerial  Handlers-cleaners
relationship  Not-in-family  Husband  Not-in-family
race  White  White  White
sex  Male  Male  Male
capital-gain 2174 0 0
capital-loss 0 0 0
hours-per-week 40 13 40
native-country  United-States  United-States  United-States
income  <=50K  <=50K  <=50K

id 1
age 39
workclass  State-gov
fnlwgt 77516
education  Bachelors
education-num 9
marital-status  Never-married
occupation  Adm-clerical
relationship  Not-in-family
race  White
sex  Male
capital-gain 2174
capital-loss 0
hours-per-week 40
native-country  United-States
income  <=50K

An introduction to our synthetic data quality report

Syntho generates a quality report for every generated synthetic dataset. Our quality report contains various common statistics such as averages and distributions, enriched with more advanced statistics, such as correlations and multivariate distributions.

 

This case study offers a sneak preview in our quality report, since the full statistical report covers many pages of descriptive statistics. The full version of the quality report for the case study with the Census Income Database (1994) is available upon request via the form at the bottom of this page.

Synthetic Data quality report: descriptive statistics

This first part of our quality report covers summary statistics of the original dataset in comparison to the synthetic dataset generated by Syntho. This table contains descriptive statistics such as averages, standard deviations, minima, maxima, and correlations. As can be seen in the table below, summary statistics for original data (left) and synthetic data (right) are nearly identical.

Also note that we are able to tailor the required amount of data (count) to fulfil the goals of your use case. Although we advise to keep the number op datapoints (count, N) similar for statistical analysis purposes, we can generate an unlimited amount of synthetic data.

Descriptive statistics

Original data: left – synthetic data: right

 

Original data

Synthetic data

 

age

capital-lossagecapital-loss
count

32.561

32.561

mean

38,58

87,30

38,59

86,96

Standard deviation

13,64

402,96

13,65

402,17

min

17

0

17

0

25%

28

0

28

0

50%

37

0

37

0

75%

48

0

48

0

max

904.35690

4.356

correlation

0,056726

0,057775

Age
Original Synthetic
count 32.561
mean 38,58 38,59
Standard deviation 13,64 13,65
min 17 17
25% 28 28
50% 37 37
75% 48 48
max 90 90
 capital-loss
 OriginalSynthetic
count32.561
mean87,3086,96
Standard deviation402,96402,17
min00
25%00
50%00
75%00
max4.3564.356

Synthetic Data quality report: univariate distributions

Univariate distributions give insight in the frequency of a certain data record for a given category or value. When displayed in a graph, one can observe the frequency of all data records for a certain value, which provides insight in the distribution. As can be seen in the following graphs, the distribution for original data (grey) and synthetic data (blue) are nearly identical for the categories weight, birthdate, height (continuous variables) and bats (categorial variable).

Univariate distributions

Distributions give insight in the frequency of a certain data record for a given category or value and are captured by the Syntho Engine. 

Original data: grey – synthetic data: blue

Syntho preserves distributions in the generated synthetic data
Synthetic Data quality report univariate distributions bats
The synthetic Data quality report by syntho includes univariate distributions. In this example, the birthday year variable is shown.
Synthetic Data quality report univariate distributions height
Synthetic Data quality report univariate distributions weight

Synthetic Data quality report: correlations

Correlations provide insight in the degree to which two variables are related. Displayed in a matrix, one can easily observe correlations for each combination of variables. As can be seen in the following illustration, the matrix for original data (grey) and the matrix for synthetic data (blue) are nearly identical for all combinations of categories from the Census Income Database (1994).

Correlations

Correlations provide insight in the degree to which two variables are related and are captured by the Syntho Engine.  

Original data: left in grey – synthetic data: right in blue

Correlations are preserved in generated synthetic data
Syntho preserves correlations in the generated synthetic data
Correlations Original Data
Correlations Original Data
The synthetic Data quality report by syntho includes correlations. This illustration shows the correlation matrix for the generated synthetic data.
Correlations Synthetic Data

Synthetic Data quality report: multivariate distributions and correlations

Where univariate distributions and correlations provide insight in the distributions and relationships for single categories, multivariate distributions provide insight for combinations of categories and are also captured by the Syntho Engine.

As an example, we provide you with the result of the bivariate distribution that provides insight in the frequency of a combination of data records for two given categories. When displayed in a matrix, one can observe the frequency of all possible combinations of data records for two categories. As can be seen in the following matrices, the matrix for original data (grey) and the matrix for synthetic data (blue) are nearly identical for all possible combinations of data records for the categories [height – birthdate] and [weight – height].

Multivariate distributions

Multivariate distributions and correlations provide insight for combinations of categories and are also captured by the Syntho Engine.

Original data: left in grey – synthetic data: right in bluecq

Synthetic Data quality report multivariate distributions birthdata day heaght
Synthetic Data quality report multivariate distributions weight height

Since multivariate correlations and distributions can be calculated for each combination of categories, this exercise will result in many graphs. As illustrated, we only provide the bivariate distribution between age and education in this shortened version of the quality report and are more than happy to share our full version of this Census Income Database (1994) case study.

Synthetic Data quality report: deep ‘hidden’ relations

The Syntho Engine also captures deeper, ‘hidden’ relations, other than multivariate correlations and bivariate distributions. We evaluate these by a build in mechanism in our Syntho Engine that evaluates the accuracy of a machine learning model that is trained to distinguish between real and synthetic data. Experiments show that even machine learning models can barely distinguish original data from synthetic data. These outcomes are not only one of the interesting elements in our quality report, these are also the key performance measure that Syntho uses for optimization of the Syntho Engine.

As all deep ‘hidden’ relations of the data are retained within synthetic data, it can be used as a full replacement for the original data in machine learnings tasks. As an example we will use the Census dataset again, let’s say you want to predict the income of an individual given its other characteristic. Now if we test the accuracy on a separate hold test set for two machine learning models, one trained on the original data and the other trained on the synthetic data, an accuracy loss of less of than 2% is observed. This shows that synthetic data retains statistical properties to such an extent that, even in machine learning tasks, it provides a superior alternative for real data.

External references

Analytics experts from SAS assess our synthetic data

We aim for maximal data quality. Hence, next to our own quality report, data analytics experts from SAS (market leader in analytics) evaluate generated synthetic datasets from Syntho on a regular basis via various analytics (Artificial Intelligence, Business Intelligence, modeling, algorithm training etc.) assessments and will be available for you on-demand. We do this, because we would like to provide you with insights into our superior data quality, but not only from the point of view of the generator or synthetic data (Syntho), but from and external and objective point of view (in this case SAS, market leader in analytics).

Those results will be available on request. Additionally, we will host a webinar with SAS, where the analytics experts will share their results with you.

SAS: www.sas.com 

The truth about synthetic data revealed

Interested?

Request the full quality report version of this Census Income Database (1994) case study!

Interested?

Request the full quality report version of this Census Income Database (1994) case study!