Synthetic data preserves statistical properties
At Syntho we regularly get questions on what we exactly mean by saying “synthetic data has the same statistical properties as the original data”. In this blog we’ll answer this question and highlight some key statistics from the quality report that comes with each synthetic dataset we deliver.
Synthetic data?
Synthetic data consists of completely new and artificially generated data. Individuals and attributes related to individuals simply do not exist in synthetic data. While it does not contain any personally identifiable information (PII), synthetic data preserves the properties and structure of the original dataset. Essentially, the idea of synthetic data is that you can use it as though it is real data, with the only exception of not being able to identify real individuals.
Synthetic Data by Syntho results in two key attributes
- It is impossible to identify and reverse-engineer individuals
Synthetic data is completely new and artificially generated data that is guaranteed anonymous. Since the data is fully synthetic, real individuals cannot be identified using this data.
- Syntho preserves the statistical properties and structure of the original data
The Syntho Engine captures all the relevant properties, structures and statistics of the original data. Hence, one experiences similar data utility with the synthetic data as with the original data.
Introduction to the case study
This case study demonstrates highlights from our quality report containing various statistics from synthetic data generated through our Syntho Engine in comparison to the original data. We practice with the publicly available income dataset from the Census Income Database (1994). This dataset contains various personal attributes (age, education, relationship etc.) enriched with income-related data (hours-per-week, income etc.). The various categories from this original dataset supplemented with example data records are found in the table below.
id | 1 | 2 | 3 |
age | 39 | 50 | 38 |
workclass | State-gov | Self-emp-not-inc | Private |
fnlwgt | 77516 | 83311 | 215646 |
education | Bachelors | Bachelors | HS-grad |
education-num | 13 | 13 | 9 |
marital-status | Never-married | Married-civ-spouse | Divorced |
occupation | Adm-clerical | Exec-managerial | Handlers-cleaners |
relationship | Not-in-family | Husband | Not-in-family |
race | White | White | White |
sex | Male | Male | Male |
capital-gain | 2174 | 0 | 0 |
capital-loss | 0 | 0 | 0 |
hours-per-week | 40 | 13 | 40 |
native-country | United-States | United-States | United-States |
income | <=50K | <=50K | <=50K |
id | 1 |
age | 39 |
workclass | State-gov |
fnlwgt | 77516 |
education | Bachelors |
education-num | 9 |
marital-status | Never-married |
occupation | Adm-clerical |
relationship | Not-in-family |
race | White |
sex | Male |
capital-gain | 2174 |
capital-loss | 0 |
hours-per-week | 40 |
native-country | United-States |
income | <=50K |
Syntho’s quality report
Syntho generates a quality report for every generated synthetic dataset. Our quality report contains various common statistics such as averages and distributions, enriched with more advanced statistics, such as correlations and multivariate distributions.
This case study offers a sneak preview in the full version of the quality report, since the full statistical report covers tens of pages of descriptive statistics. The full version of the quality report for the case study with the Census Income Database (1994) is available upon request via the form at the bottom of this page.
Syntho preserves descriptive statistics
Part of the quality report is a table containing summary statistics of the original dataset in comparison to the synthetic dataset generated by Syntho. This table contains descriptive statistics such as averages, standard deviations, minima, maxima, and correlations. As can be seen in the table below, summary statistics for original data (left) and synthetic data (right) are nearly identical. Note that we are able to tailor the required amount of data to fulfil the goals of your use case.
Original data | Synthetic data | |||
age | capital-loss | age | capital-loss | |
count | 32.561 | 32.561 | ∞ | ∞ |
mean | 38,58 | 87,30 | 38,59 | 86,96 |
Standard deviation | 13,64 | 402,96 | 13,65 | 402,17 |
min | 17 | 0 | 17 | 0 |
25% | 28 | 0 | 28 | 0 |
50% | 37 | 0 | 37 | 0 |
75% | 48 | 0 | 48 | 0 |
max | 90 | 4.356 | 90 | 4.356 |
correlation | 0,056726 | 0,057775 |
Age | ||
Original | Synthetic | |
count | 32.561 | ∞ |
mean | 38,58 | 38,59 |
Standard deviation | 13,64 | 13,65 |
min | 17 | 17 |
25% | 28 | 28 |
50% | 37 | 37 |
75% | 48 | 48 |
max | 90 | 90 |
capital-loss | ||
Original | Synthetic | |
count | 32.561 | ∞ |
mean | 87,30 | 86,96 |
Standard deviation | 402,96 | 402,17 |
min | 0 | 0 |
25% | 0 | 0 |
50% | 0 | 0 |
75% | 0 | 0 |
max | 4.356 | 4.356 |
Syntho preserves univariate distributions
Univariate distributions give insight in the frequency of a certain data record for a given category or value. When displayed in a graph, one can observe the frequency of all data records for a certain value, which provides insight in the distribution. As can be seen in the following graphs, the distribution for original data (grey) and synthetic data (blue) are nearly identical for the categories age (continuous variable) and workclass (categorial variable).
Syntho preserves correlations
Correlations provide insight in the degree to which two variables are related. Displayed in a matrix, one can easily observe correlations for each variable combination. As can be seen in the following matrices, the matrix for original data (grey) and the matrix for synthetic data (blue) are nearly identical for all combinations of categories from the Census Income Database (1994).
Syntho preserves multivariate distributions and correlations
Where univariate distributions and correlations provide insight in the distribution and relationship for single categories, multivariate distributions provide insight for combinations of categories and are also captured by the Syntho Engine.
As an example, we provide you with the result of the bivariate distribution that provides insight in the frequency of a combination of data records for two given categories. When displayed in a matrix, one can observe the frequency of all possible combinations of data records for two categories, which provides insight in the bivariate distribution. As can be seen in the following matrices, the matrix for original data (grey) and the matrix for synthetic data (blue) are nearly identical for all possible combinations of data records for the two categories age and education.
Since multivariate correlations and distributions can be formed as endless combinations, they result in numerous graphs for potential analysis. As illustrated, we only provide the bivariate distribution between age and education in this shortened version of the quality report.
Syntho preserves deep ‘hidden’ relations
The Syntho Engine also captures deeper, ‘hidden’ relations, other than multivariate correlations and bivariate distributions. We evaluate these by a build in mechanism in our Syntho Engine that evaluates the accuracy of a machine learning model that is trained to distinguish between real and synthetic data. Experiments show that even machine learning models can barely distinguish original data from synthetic data. These outcomes are not only one of the interesting elements in our quality report, these are also the key performance measure that Syntho uses for optimization of the Syntho Engine.
As all deep ‘hidden’ relations of the data are retained within synthetic data, it can be used as a full replacement for the original data in machine learnings tasks. As an example we will use the Census dataset again, let’s say you want to predict the income of an individual given its other characteristic. Now if we test the accuracy on a separately hold test set for two machine learning models, one trained on the original data and the other trained on the synthetic data, an accuracy loss of less of than 2% is observed. This shows that synthetic data retains statistical properties to such an extent that, even in machine learning tasks, it provides a viable alternative for real data.
Explore the added value of synthetic data with Syntho
Getting familiar with the potential added value of synthetic data works best when you actually work with a synthetic version of your own original dataset. To realize this, Syntho offers the opportunity for a pilot. This pilot leads to a full synthetic version of your original dataset, supplied with a quality report and tailor-made dashboard. Consequently, you are able to fully explore the added value of synthetic data with us.
Key pilot deliverables
Fully synthetic and anonymous dataset that preserves the value of your original dataset
Quality report demonstrating the statistical significance of your synthetic dataset
Hands-on experience with synthetic data as a best practice for privacy-by-design
MEET SYNTHO
Leave your contact details and we get back to you at the speed of light!