Synthetic data: software test and development environment

Minimize production data and / or scrambled data in your test environments by embracing synthetic data.

Synthetic data test environment

THE OLD SITUATION

Production data and / or scrambled data in your test environment

Since the introduction of GDPR, companies are obligated to define why using personal data is required. If those companies do not have a legitimate reason to do so, GDPR requires to specifically obtain permission. On top of this, the ‘data minimalization’ principle states that companies are required to minimize the use of personal data and consequently only use it when strictly necessary.

Typically, software testing and development is not the initial purpose of personal data collection. Combined with the data minimalization principle, it results in a situation where using privacy sensitive production data for test and develop purposes is not allowed.

Why is this particularly applicable for software test and development environments?

Software test environments are typically used for software development purposes to develop software upon. Frequently, organizations work with 3th party software developers that are commonly located in a different countries or continents and might even be external hired (not on the company pay-roll).

Consequently, having production data in your test environment results in a situation where your data leaves your save environment, and potentially also to different countries or continents.

How do companies typically arrange their software or application test environments?

Frequently, we see companies having one of the following suboptimal and questionable solutions in place for their test environment(s):

  1. Production data: Using production data for test and acceptance environment purposes is simple to implement and results in – off course – representative data. However, it offers no privacy protection and as mentioned before; it is simply not allowed.
  2. Scrambled data: What if we scramble our data? This classic anonymization technique is relatively simple to apply. However, this technique results in un-representative data for test and development purposes, mainly because it mixes up business rules that one actually would like to test. Furthermore, unique variables and/or outliers might be de-anonymized, resulting in a situation where a privacy risk remains.
  3. Dummy data: This solution offers high privacy protection, because random dummy data simply does not contain any real data. However, it has a low data quality in comparison to the original data. In practice, one typically asks testers or developers to generate dummy data. This does not only costs (expensive) time, it also is not the most inspiring task for your testers or developers, taking down the energy level and motivation.
Software test environment with synthetic data
 Data qualityPrivacy protectionRequired effort
Production dataHighLowLow
Scrambled dataLowMediumLow
Dummy dataLowHighHigh

The test environment dilemma

And this introduces a dilemma, because test environments typically require representative data to allow your testers, developers and product owners to assess your software or application in representative scenario’s. With no representative data, you cannot simulate representative test scenario’s. 

How to arrange a test and acceptance environment with representative data, without having production data in it?

OUR SOLUTION

Embrace synthetic data in your test environment

Synthetic data by Syntho reproduces the same statistical characteristics of your original dataset, while warranting that no records from the original dataset are present and specific individuals cannot be traced back. Hence, one can set up a test environment and an acceptance environment that has the same statistical characteristics of the original production environment that does not contain records from it. Consequently, using synthetic data for your test environment and development environment has 3 benefits, as illustrated in figure 2:

  1. Synthetic data approaches the statistical properties of the original data, so interactions and patterns are preserved. Consequently synthetic data is realistic and representative.
  2. Synthetic data does not contain records from the original dataset. Hence, synthetic data rules out privacy risk.
  3. Original sensitive or poorly (classicaly) anonimized data does not leave the building, so the likelihood of data breaches is minimized.

The result: a representative test environment and a representative acceptance environment with no privacy risk.

Software test environment with synthetic data new

Synthetic data generation for edge-case scenario testing

Synthetic data by Syntho reproduces the same statistical characteristics of your original dataset, while warranting that no records from the original dataset are present and specific individuals cannot be traced back. Hence, one can set up a test environment and an acceptance environment that has the same statistical characteristics of the original production environment that does not contain records from it. Consequently, using synthetic data for your test environment and development environment has 3 benefits, as illustrated in figure 2:

  1. Synthetic data approaches the statistical properties of the original data, so interactions and patterns are preserved. Consequently synthetic data is realistic and representative.
  2. Synthetic data does not contain records from the original dataset. Hence, synthetic data rules out privacy risk.
  3. Original sensitive or poorly (classicaly) anonimized data does not leave the building, so the likelihood of data breaches is minimized.

The result: a representative test environment and a representative acceptance environment with no privacy risk.

Data balancing with Synthetic Data

Synthetic data generation for non-existing data

Often when developing (new) features, data-quantity is insufficient, data is not present yet or data is not present at all to perform the desired test scenarios to assess the quality of your application. To overcome this, the Syntho engine operates as data generator to tailor the data quantity, calibrate the statistical properties or even create dummy data. This allows you to produce data for test scenario’s that you otherwise would not be able to perform.

Software test environment with synthetic data new extra

Why our customers use synthetic data

Build a strong foundation to realize data-driven innovation with ...

1

LESS RISK

Less risk with synthetic data

2

MORE DATA

Syntho - synthetic data software to for more data in data driven innovation.innovation.

3

FASTER DATA ACCESS

Syntho - synthetic data software to for faster data access in data driven innovation.innovation.

Boost the realization of data-driven innovation now!