Synthetic data use case: synthetic test data
Set up a future proof test infrastructure with synthetic test data as alternative to production data, scrambled data, dummy data or masked data.
The old situations
Production data, scrambled data, dummy data or masked data as test data
Since the introduction of GDPR, companies are obligated to define why using personal data is required. If those companies do not have a legitimate reason to do so, GDPR requires to specifically obtain permission. On top of this, the ‘data minimalization’ principle states that companies are required to minimize the use of personal data and consequently only use it when strictly necessary.
Typically, software testing is not the initial purpose of personal data collection. Combined with the data minimalization principle, it results in a situation where using privacy sensitive production data as test data is not allowed.
Why is this particularly applicable for software test infrastructure?
Software test environments are typically used for software development purposes. Frequently, organizations work with 3rd party software developers that are commonly located in a different countries or continents and might even be external hired (not on the company pay-roll).
Consequently, having production data in your test environment results in a situation where your data leaves your save environment, and potentially also to different countries or continents.
What test data do companies typically use?
Frequently, we see companies having one of the following suboptimal and questionable solutions in place for their test environment(s):
Production data as test data:
Using production data for test purposes is simple to implement and results in – off course – representative data. However, it offers no privacy protection and as mentioned before; it is simply not allowed.
Scrambled data as test data:
What if we scramble our data? This classic anonymization technique is relatively simple to apply. However, this technique results in un-representative data for test purposes, mainly because it mixes up business rules that one actually would like to test. Furthermore, unique variables and/or outliers might be de-anonymized, resulting in a situation where a privacy risk remains.
Dummy data as test data:
This solution offers high privacy protection, because random dummy data simply does not contain any real data. However, it has a low data quality in comparison to the original data. In practice, one typically asks testers or developers to generate dummy data. This does not only costs (expensive) time, it also is not the most inspiring task for your testers or developers, taking down the energy level and motivation.
|Data quality||Privacy protection||Required effort|
The test data dilemma
And this introduces a dilemma, because test environments typically require representative data to allow your testers, developers and product owners to assess your software or application in representative scenario’s. With no representative data, you cannot simulate representative test scenario’s.
How to arrange a test infrastructure with representative data, without having production data in it?
Synthetic test data: use AI generated synthetic data as test data
Synthetic data by Syntho reproduces the same statistical characteristics of your original dataset, while warranting that no records from the original dataset are present and specific individuals cannot be traced back. Hence, one can set up a test environment and an acceptance environment that has the same statistical characteristics of the original production environment that does not contain records from it. Consequently, using synthetic data for your test environment and development environment has 3 benefits, as illustrated in figure 2:
- Synthetic data approaches the statistical properties of the original data, so interactions and patterns are preserved. Consequently synthetic data is realistic and representative.
- Synthetic data does not contain records from the original dataset. Hence, synthetic data rules out privacy risk.
- Original sensitive or poorly (classicaly) anonimized data does not leave the building, so the likelihood of data breaches is minimized.
The result: representative test infrastructure with no privacy risk.
Synthetic test data generation for edge case scenario testing
Sometimes, you want to specifically test certain edge cases. However, edge cases typically do no happen often in practice and hence, do not appear frequently in your datasets.
We can solve this by tweaking the dataset in the synthetization process. How does it work? We can illustrate this with an example where we have only 33% females in comparison to 66% males in the original dataset. Syntho supports smart data generation to generate extra synthetic female or male data records. In this example, we can generate extra synthetic data of a certain subset to tweak and balance the dataset to your own test preferences (in this example, back to 50% females and 50% males).
Synthetic test data generation for non-existing data
Often when developing (new) features, data-quantity is insufficient, data is not present yet or data is not present at all to perform the desired test scenarios to assess the quality of your application. To overcome this, the Syntho engine operates as data generator to tailor the data quantity, calibrate the statistical properties or even create dummy data. This allows you to produce data for test scenario’s that you otherwise would not be able to perform.
Why our customers use synthetic data
Synthetic data allows you to build a strong foundation to realize innovations with …
Superior data quality
Easy and fast