Assume that your organization has a software solution that you want to improve. To effectively do so, you will have to test and develop to deliver new features. Typically, product owners, software developers and software testers use test infrastructure to test and develop the software solution to make sure new features fit in existing business logic, that your software solution works smoothly and is bug-free. Here, we will discuss test data in particular as foundation to test, develop and deliver state-of-the-art software solutions.
Naturally, using more accurate test data results in more accurate tests. Consequently, the best test data to use is probably your actual production data. Simple, right?
Unfortunately, it is not. Due to (privacy) regulations, organizations are not allowed to use actual production data as test data. Consequently, organizations have to build an effective test infrastructure with alternative solutions.
Typically, we see companies having one of the following suboptimal and questionable test data ‘solutions’ in place (click to open):
With anonymization or masking, one manipulates original data to hinder tracing back individuals. For example, one deletes parts of the data, generalizes the data or hussles rows and columns. If you do this to a small extend, it can still be possible to regenerate original parts of the data, which triggers privacy risks. If you do this more thoroughly, you destroy your data and lose valuable data quality and business logic. Since this trade-off always exists, anonymizing or masking is not a great way to go, since it results in a sub-optimal combination between incurred risks and data quality.
Moreover, we see that this is often a cost- and time-intensive process, because these techniques work different for each dataset and for each data type. This introduces internal discussions about how to apply these techniques and the level of risk mitigation that you must achieve. Since the data format changes frequently, you must start-over again.
What if we scramble our data? This classic technique is relatively simple to apply. However, this technique results in un-representative data for test purposes. This is mainly because it mixes up all the business logics that one actually would like to test. Furthermore, unique variables and/or outliers might be mixed up, but a risk of de-identification remains (e.g., scrambling the salaries of your company: probably your CEO has the highest salary, so you will be able to spot him).
Dummy data is randomly created data, usually by hand. Although this is randomly generated and does not contain any risks, it is not representative and business logic . It does not present any statistical viability and relevant business logic, since the data is not real. Additionally, creating dummy data is often a cost- and time-intensive process.
Although the options stated above are suboptimal, they are still widely used, because there is a lack of better alternatives.
Data quality | (Privacy) risks | Required effort | |
Production data | High | High | Low |
Anonymized data or masked data | Medium | Medium | High |
Scrambled data | Low | Medium | Low |
Dummy data | Low | Low | High |
Syntho is expert in end-to-end synthetic data generation and implementation. We excel in both generating (1) synthetic data twins and supporting various (2) synthetic data optimization, augmentation and simulation features. Both can be used to generate synthetic test data so that you can test, develop and deliver state-of-the-art software solutions.
When generating a Synthetic Data Twin, Syntho mimics the original data as closely as possible while realizing privacy. Syntho generates completely new datapoints and models them in such a way that the properties, relationships and statistical patterns of the original data are preserved. Even complex, hidden patterns, relationships and inefficiencies are captured, so the synthetic data can be used as a direct alternative to the original data.
The foundation for Synthetic Data Optimization and Augmentation is a Synthetic Data Twin. From this foundation, we can optimize and augment your data using smart generative AI based on the requirements, logic and constraints of your business. We offer various value-adding synthetic data optimization and augmentation features to take your (both ‘dirty’ or ‘clean’) data to the next level.
Our software generates completely new datapoints and models these new datapoints in such a way that we preserve properties, relationships and statistical patterns to guarantee high quality with preserved business logic. With multi-table databases and many different software applications, we sustain referential integrity. This means that person A with properties XYZ in dataset 1 is the same person A with properties XYZ in dataset 2, 3, 4 etc.
Instead of building datasets by hand, we can generate large datasets and complex databases fully automatically with our AI software, without any additional required knowledge. Here, we know that data structures, datatypes and data observations change over time. That’s why the value of being able to update the test data of your complete test infrastructure within minutes is so crucial.
Testing edge cases is crucial. However, edge cases typically do not happen often and hence, relevant data to test these scenarios is scarce. We can solve this with our data optimization, augmentation and simulation features to generate extra edge case related data. This allows you to optimize your test data and support you when testing edge cases.
How does this work? Suppose you have 33% females and 66% males. We can generate extra synthetic data of a certain subset to tweak and balance the dataset to your own test preferences. In this example we would generate extra synthetic female data records to balance back to 50% females and 50% males.
When you develop new features, you typically do not have data yet. Hence, you cannot test, develop and deliver your software solution. Another example, you have combinations of datapoints that did not happened in the past, but could happen in the future. Hence, you do not have this data, but you would like to make sure that you have tested your software solution in these scenario’s.
We can solve this with our data optimization, augmentation and simulation features to generate smart synthetic data. This allows you to test, develop and deliver in scenarios that you otherwise would do with no data or with manual created data.