Why classic anonymization fails

What is classic anonymization?

With classic anonymization, we imply all methodologies where one manipulates or distorts an original dataset to hinder tracing back individuals.

Typical examples of classic anonymization that we see in practice are generalization, suppression / wiping, pseudonymization and row and column shuffling.

Hereby those techniques with corresponding examples.

Technique Original data Manipulated data
Generalization 27 years old Between 25 and 30 years old
Suppression / Wiping info@syntho.ai xxxx@xxxxxx.xx
Pseudonymization Amsterdam hVFD6td3jdHHj78ghdgrewui6
Row and column shuffling Aligned Shuffled

Manipulating a dataset with classic anonymization techniques results in 2 keys disadvantages:

  1. Distorting a dataset results in decreased data quality (i.e. data utility).
  2. Privacy risk will be reduced, but will always be present.

We demonstrate those 2 key disadvantages, data utility and privacy protection. We do that  with the following illustration with applied suppression and generalization.

Note: we use images for illustrative purposes. The same principle holds for structured datasets containing rows and columns.

  • Left: little application of classic anonymization result in a representative illustration. However, the individual can easily be identified and privacy risk is significant.
  • Right: severe application of classic anonymization results in strong privacy protection. However, the illustration became useless.

This introduces the trade-off between data utility and privacy protection, where classic anonymization techniques always offer a suboptimal combination of both. 

Frequent question that we receive: is removing all the direct identifiers (such as names) from the dataset a solution?

No. This is a big misconception and does not result in anonymous data. Do you still apply this as way to anonymize your dataset? Then this blog is a must read for you.

How is Synthetic Data different?

Synthetic data: call it artificial data or fake data. Synthetic data is an entirely new dataset of artificially generated data records. Information to identify real individuals is simply not present in a synthetic dataset.

The key difference at Syntho: we apply machine learning. This means we are able to preserve the structure and properties of the original dataset. We generate realistic synthetic data that can be used as if it is real data. In accordance, one experiences similar data quality with the synthetic data as with the original dataset.

This case study demonstrates highlights from our quality report containing various statistics from synthetic data generated through our Syntho Engine in comparison to the original data.

Consequently, synthetic data is the preferred solution to overcome the typical sub-optimal trade-off between data utility and privacy protection, that all classic anonymization techniques offer you.

So, why still use personal data if you can use synthetic data?

From a privacy and utility perspective you should always opt for synthetic data when your use case allows so

Value for analysis Privacy risk
Synthetic data High None
Real (personal) data High High
Manipulated data (through classic ‘anonymization’) Low-Medium Medium-High

Synthetic data by Syntho fills the gaps where classic anonymization techniques fall short by maximizing both data utility and privacy protection