Syntho develops software for AI generated synthetic data. As opposed to using sensitive original data, customers use our AI software to create top-notch synthetic data. We generate entirely new data, but we are able to model those new datapoints to preserve the characteristics, relationships and statistical patterns of the original data to such an extent that it can be used as-if it is original data. Syntho offers a quality report for every generated synthetic dataset to demonstrate this. Our quality report contains various basic statistics, including aggregates, distributions and correlations, enriched with more advanced measures, such as multivariate distributions. This blog illustrates highlights from our data quality report based on a case-study quality with the public available Census Income Database (1994)
This case study demonstrates highlights from our quality report, containing various statistics from synthetic data generated through our Syntho Engine in comparison to the original data. We practice with the publicly available income dataset from the Census Income Database (1994). This dataset contains various personal attributes (age, education, relationship etc.) enriched with income-related data (hours-per-week, income etc.). The various categories from this original dataset supplemented with example data records are found in the table below.
The original Census Income Database (1994)
Example data records
id | 1 | 2 | 3 |
age | 39 | 50 | 38 |
workclass | State-gov | Self-emp-not-inc | Private |
fnlwgt | 77516 | 83311 | 215646 |
education | Bachelors | Bachelors | HS-grad |
education-num | 13 | 13 | 9 |
marital-status | Never-married | Married-civ-spouse | Divorced |
occupation | Adm-clerical | Exec-managerial | Handlers-cleaners |
relationship | Not-in-family | Husband | Not-in-family |
race | White | White | White |
sex | Male | Male | Male |
capital-gain | 2174 | 0 | 0 |
capital-loss | 0 | 0 | 0 |
hours-per-week | 40 | 13 | 40 |
native-country | United-States | United-States | United-States |
income | <=50K | <=50K | <=50K |
id | 1 |
age | 39 |
workclass | State-gov |
fnlwgt | 77516 |
education | Bachelors |
education-num | 9 |
marital-status | Never-married |
occupation | Adm-clerical |
relationship | Not-in-family |
race | White |
sex | Male |
capital-gain | 2174 |
capital-loss | 0 |
hours-per-week | 40 |
native-country | United-States |
income | <=50K |
Syntho generates a quality report for every generated synthetic dataset. Our quality report contains various common statistics such as averages and distributions, enriched with more advanced statistics, such as correlations and multivariate distributions.
This case study offers a sneak preview in our quality report, since the full statistical report covers many pages of descriptive statistics. The full version of the quality report for the case study with the Census Income Database (1994) is available upon request via the form at the bottom of this page.
This first part of our quality report covers summary statistics of the original dataset in comparison to the synthetic dataset generated by Syntho. This table contains descriptive statistics such as averages, standard deviations, minima, maxima, and correlations. As can be seen in the table below, summary statistics for original data (left) and synthetic data (right) are nearly identical.
Also note that we are able to tailor the required amount of data (count) to fulfil the goals of your use case. Although we advise to keep the number op datapoints (count, N) similar for statistical analysis purposes, we can generate an unlimited amount of synthetic data.
Descriptive statistics
Original data: left – synthetic data: right
Original data | Synthetic data | |||
age | capital-loss | age | capital-loss | |
count | 32.561 | 32.561 | ∞ | ∞ |
mean | 38,58 | 87,30 | 38,59 | 86,96 |
Standard deviation | 13,64 | 402,96 | 13,65 | 402,17 |
min | 17 | 0 | 17 | 0 |
25% | 28 | 0 | 28 | 0 |
50% | 37 | 0 | 37 | 0 |
75% | 48 | 0 | 48 | 0 |
max | 90 | 4.356 | 90 | 4.356 |
correlation | 0,056726 | 0,057775 |
Age | ||
Original | Synthetic | |
count | 32.561 | ∞ |
mean | 38,58 | 38,59 |
Standard deviation | 13,64 | 13,65 |
min | 17 | 17 |
25% | 28 | 28 |
50% | 37 | 37 |
75% | 48 | 48 |
max | 90 | 90 |
capital-loss | ||
Original | Synthetic | |
count | 32.561 | ∞ |
mean | 87,30 | 86,96 |
Standard deviation | 402,96 | 402,17 |
min | 0 | 0 |
25% | 0 | 0 |
50% | 0 | 0 |
75% | 0 | 0 |
max | 4.356 | 4.356 |
Univariate distributions give insight in the frequency of a certain data record for a given category or value. When displayed in a graph, one can observe the frequency of all data records for a certain value, which provides insight in the distribution. As can be seen in the following graphs, the distribution for original data (grey) and synthetic data (blue) are nearly identical for the categories weight, birthdate, height (continuous variables) and bats (categorial variable).
Distributions give insight in the frequency of a certain data record for a given category or value and are captured by the Syntho Engine.
Original data: grey – synthetic data: blue
Correlations provide insight in the degree to which two variables are related. Displayed in a matrix, one can easily observe correlations for each combination of variables. As can be seen in the following illustration, the matrix for original data (grey) and the matrix for synthetic data (blue) are nearly identical for all combinations of categories from the Census Income Database (1994).
Correlations provide insight in the degree to which two variables are related and are captured by the Syntho Engine.
Original data: left in grey – synthetic data: right in blue
Where univariate distributions and correlations provide insight in the distributions and relationships for single categories, multivariate distributions provide insight for combinations of categories and are also captured by the Syntho Engine.
As an example, we provide you with the result of the bivariate distribution that provides insight in the frequency of a combination of data records for two given categories. When displayed in a matrix, one can observe the frequency of all possible combinations of data records for two categories. As can be seen in the following matrices, the matrix for original data (grey) and the matrix for synthetic data (blue) are nearly identical for all possible combinations of data records for the categories [height – birthdate] and [weight – height].
Multivariate distributions and correlations provide insight for combinations of categories and are also captured by the Syntho Engine.
Original data: left in grey – synthetic data: right in bluecq
Since multivariate correlations and distributions can be calculated for each combination of categories, this exercise will result in many graphs. As illustrated, we only provide the bivariate distribution between age and education in this shortened version of the quality report and are more than happy to share our full version of this Census Income Database (1994) case study.
The Syntho Engine also captures deeper, ‘hidden’ relations, other than multivariate correlations and bivariate distributions. We evaluate these by a build in mechanism in our Syntho Engine that evaluates the accuracy of a machine learning model that is trained to distinguish between real and synthetic data. Experiments show that even machine learning models can barely distinguish original data from synthetic data. These outcomes are not only one of the interesting elements in our quality report, these are also the key performance measure that Syntho uses for optimization of the Syntho Engine.
As all deep ‘hidden’ relations of the data are retained within synthetic data, it can be used as a full replacement for the original data in machine learnings tasks. As an example we will use the Census dataset again, let’s say you want to predict the income of an individual given its other characteristic. Now if we test the accuracy on a separate hold test set for two machine learning models, one trained on the original data and the other trained on the synthetic data, an accuracy loss of less of than 2% is observed. This shows that synthetic data retains statistical properties to such an extent that, even in machine learning tasks, it provides a superior alternative for real data.
We aim for maximal data quality. Hence, next to our own quality report, data analytics experts from SAS (market leader in analytics) evaluate generated synthetic datasets from Syntho on a regular basis via various analytics (Artificial Intelligence, Business Intelligence, modeling, algorithm training etc.) assessments and will be available for you on-demand. We do this, because we would like to provide you with insights into our superior data quality, but not only from the point of view of the generator or synthetic data (Syntho), but from and external and objective point of view (in this case SAS, market leader in analytics).
Those results will be available on request. Additionally, we will host a webinar with SAS, where the analytics experts will share their results with you.
SAS: www.sas.com