Software test and development environments

Code 3

Establish a representative and GDPR compliant test and acceptance environment by embracing synthetic data. 

Does GDPR impact how I establish my test & acceptance environment?

Let me ask you this question: does your production data contain personal data?

  • If your answer is yes, then GDPR impacts how you should establish your test & acceptance environment.

GDPR, can you tell me a bit more?

Definitely, since the introduction of GDPR, companies are obligated to define why using personal data is required. If those companies do not have a legitimate reason to do so, GDPR requires to specifically obtain permission. On top of this, the ‘data minimalization’ principle states that companies are required to minimize the use of personal data and consequently only use it when strictly necessary. Typically, application testing and development is not the initial purpose of personal data collection and combined with the data minimalization principle results in a situation where using privacy sensitive production data for test and develop purposes is not allowed.

Why is this particularly applicable for test & acceptance environments?

Test and acceptance environments are typically used for software development purposes to develop software upon. Frequently, organizations work with 3th party software developers that are commonly located in a different countries or continents. Consequently, having production data in your test & acceptance environment results in a situation where your data leaves your save environment, and potentially also to different countries or continents.

The test and acceptance environment dilemma

And this introduces a dilemma, because test and development environments typically require representative data to allow your testers, developers and product owners to assess your software or application in representative scenario’s. With no representative data, you cannot simulate representative test scenario’s. How to arrange a test and acceptance environment with representative data, without having production data in it?

How do companies typically arrange their test & acceptance environments?

Many companies have suboptimal solutions in place for this dilemma. Frequently, we see companies having one of the following suboptimal situations in their test and acceptance environment:

 

  1. Production data: Using production data for test and acceptance environment purposes is simple to implement and results in – off course – representative data. However, it offers no privacy protection and as mentioned before; it is simply not allowed.
  2. Scrambled data: What if we scramble our data? This classic anonymization technique is relatively simple to apply. However, this technique results in un-representative data for test and development purposes, mainly because it mixes up business rules that one actually would like to test. Furthermore, unique variables and/or outliers might be de-anonymized, resulting in a situation where a privacy risk remains. Off course one could delete the outliers, but typically one would like to assess how the software is handling outlier and edge case situations before releasing.
  3. Dummy data: This solution offers high privacy protection, because random dummy data simply does not contain any real data. However, it has a low data quality in comparison to the original data. In practise, one typically asks testers or developers to generate dummy data. This does not only costs (expensive) time, it also is not the most inspiring task for your testers or developers, taking down the energy level and motivation.
 Data qualityPrivacy protectionRequired effort
Production dataHighLowLow
Scrambled dataLowMediumLow
Dummy dataLowHighHigh

 

 

Would Synthetic Data offer a solution?

Synthetic data by Syntho reproduces the same statistical characteristics of your original dataset, while warranting that no records from the original dataset are present and specific individuals cannot be traced back. Hence, one can set up a test environment and an acceptance environment that has the same statistical characteristics of the original production environment that does not contain records from it. Consequently, using synthetic data for your test environment and development environment has 3 benefits, as illustrated in figure 2:

  1. Synthetic data approaches the statistical properties of the original data, so interactions and patterns are preserved. Consequently synthetic data is realistic and representative.
  2. Synthetic data does not contain records from the original dataset. Hence, synthetic data rules out privacy risk.
  3. Original sensitive or poorly (classicaly) anonimized data does not leave the building, so the likelihood of data breaches is minimized.

The result: a representative test environment and a representative acceptance environment with no privacy risk.

Figure 2: synthetic data for your test environment and acceptance environment

Software test and development environments 1

Synthetic data generation for non-existing data

Often when developing (new) features, data-quantity is insufficient, data is not present yet or data is not present at all to perform the desired test scenarios to assess the quality of your application. To overcome this, the Syntho engine operates as data generator to tailor the data quantity, calibrate the statistical properties or even create dummy data. This allows you to produce data for test scenario’s that you otherwise would not be able to perform.

Figure 3: data-synthetisation and data-generation

Software test and development environments 2