What is synthetic data?

A crash course synthetic data

 

 

Introduction

What is synthetic data?

The answer is relatively simple. Whereas original data is collected in all your interactions with real persons (e.g. clients, patients, employees etc.) and via all your internal processes, synthetic data is generated by a computer algorithm. This computer algorithm generates completely new and artificial datapoints.

Solve data privacy challenges

Synthetically generated data consists of completely new and artificial datapoints with no one-to-one relations to the original data. Hence, none of the synthetic datapoints can be traced back or reverse engineered to original data. As a result, synthetic data is exempt from privacy regulations, such as the GDPR and serves as solution to solve and overcome data-privacy challenges.

Augment and simulate

The generative aspect of synthetic data generation allows to augment and simulate completely new data. This functions as solution when you have not enough data (data scarcity), would like to up-sample edge-cases or when you do not have data yet.

Here, the focus of Syntho is structured data (data formatted in tables containing rows and columns, like you see in a Excel sheets), but we always like to illustrate the concept of synthetic data via images, because it is more appealing.

Types of synthetic data

Three types of synthetic data do exist within the synthetic data umbrella. Those 3 types of synthetic data are: dummy data, rule-based generated synthetic data and synthetic data generated by artificial intelligence (AI). We shortly explain what the 3 different types of synthetic data are.

Dummy data / mock data

Dummy data is randomly generated data (e.g. by a mock data generator).

Consequently, characteristics, relationships and statistical patterns that are in the original data are not preserved, captured and reproduced in the generated dummy data. Hence, the representativeness of dummy data / mock data is minimal in comparison to the original data.

  • When to use it: to replace direct identifiers (PII) or when you do not have data (yet) and do not want to spend time and energy on defining rules.

Rule-based generated synthetic data

Rule-based generated synthetic data is synthetic data generated by a pre-defined set of rules. Examples of those pre-defined rules could be that you would like to have synthetic data with a certain minimum value, maximum value or average value. Any of the characteristics, relationships and statistical patterns, that you would like to have reproduced in the rule-based generated synthetic data, need to be pre-defined.

Consequently, the data quality will be as good as the pre-defined set of rules. This results in challenges when high data quality is of the essence. First, one can define only a limited set of rules to be captured in the synthetic data. Additionally, setting up multiple rules will typically result in overlapping and conflicting rules. Moreover, you will never fully cover all relevant rules. Furthermore, there might be relevant rules that you are not even aware of. And finally (and not to forget), this will take you a lot of time and energy resulting in a non-efficient solution.

  • When to use it: when you do not have data (yet)

Synthetic data generated by artificial intelligence (AI)

As you expect from the name, synthetic data generated by artificial intelligence (AI) is synthetic data generated by an artificial intelligence (AI) algorithm. The AI model is trained on the original data to learn all characteristics, relationships and statistical patterns. Thereafter, this AI algorithm is able to generate completely new datapoints and models those new datapoints in such a way that it reproduces the characteristics, relationships and statistical patterns from the original dataset. This is what we call a synthetic data twin.

The AI model mimics original data to generate synthetic data twins that can be used as-if it is original data. This unlock various use cases where the AI generated synthetic data can be used as alternative for using original (sensitive) data, such as the use of AI generated synthetic data as test data, demo data or for analytics.

A visualization how synthetic data is created

In comparison to rule-based generated synthetic data: instead of you studying and defining relevant rules, the AI algorithm does this automatically for you. Here, not only characteristics, relationships and statistical patterns that you are aware of will be covered, also characteristics, relationships and statistical patterns that you are not even aware of will be covered.

  • When to use it: when you have (some) data as input to mimic or to use as starting point for smart data generation and augmentation features

What type of synthetic data to use?

Depending on your use-case, a combination of dummy data / mock data, rule-based generated synthetic data or synthetic data generated by artificial intelligence (AI) is advised. This overview provides you with a first indication of which type of synthetic data to use. As Syntho supports all of them, feel free to contact our experts to deepdive your use-case with us.

This chart presents different types of synthetic data

syntho guide cover

Save your synthetic data guide now!