With classic anonymization, we imply all methodologies where one manipulates or distorts an original dataset to hinder tracing back individuals.
Typical examples of classic anonymization that we see in practice are generalization, suppression / wiping, pseudonymization and row and column shuffling.
Hereby those techniques with corresponding examples.
Technique | Original data | Manipulated data |
Generalization | 27 years old | Between 25 and 30 years old |
Suppression / Wiping | info@syntho.ai | xxxx@xxxxxx.xx |
Pseudonymization | Amsterdam | hVFD6td3jdHHj78ghdgrewui6 |
Row and column shuffling | Aligned | Shuffled |
Manipulating a dataset with classic anonymization techniques results in 2 keys disadvantages:
We demonstrate those 2 key disadvantages, data utility and privacy protection. We do that with the following illustration with applied suppression and generalization.
Note: we use images for illustrative purposes. The same principle holds for structured datasets.
This introduces the trade-off between data utility and privacy protection, where classic anonymization techniques always offer a suboptimal combination of both.
No. This is a big misconception and does not result in anonymous data. Do you still apply this as way to anonymize your dataset? Then this blog is a must read for you.
Syntho develops software to generate an entirely new dataset of fresh data records. Information to identify real individuals is simply not present in a synthetic dataset. Since synthetic data contains artificial data records generated by software, personal data is simply not present resulting in a situation with no privacy risks.
The key difference at Syntho: we apply machine learning. Consequently, our solution reproduces the structure and properties of the original dataset in the synthetic dataset resulting in maximized data-utility. Accordingly, you will be able to obtain the same results when analyzing the synthetic data as compared to using the original data.
This case study demonstrates highlights from our quality report containing various statistics from synthetic data generated through our Syntho Engine in comparison to the original data.
In conclusion, synthetic data is the preferred solution to overcome the typical sub-optimal trade-off between data-utility and privacy-protection, that all classic anonymization techniques offer you.
In conclusion, from a data-utility and privacy protection perspective, one should always opt for synthetic data when your use-case allows so.
Value for analysis | Privacy risk | |
Synthetic data | High | None |
Real (personal) data | High | High |
Manipulated data (through classic ‘anonymization’) | Low-Medium | Medium-High |
Synthetic data by Syntho fills the gaps where classic anonymization techniques fall short by maximizing both data-utility and privacy-protection.