Synthetic data client case: test data for software testing, development and delivery

Assume that your organization has a software solution that you want to improve. To effectively do so, you will have to test and develop to deliver new features. Typically, product owners, software developers and software testers use test infrastructure to test and develop the software solution to make sure new features fit in existing business logic, that your software solution works smoothly and is bug-free. Here, we will discuss test data in particular as foundation to test, develop and deliver state-of-the-art software solutions.

test data for software development

Is production data as test data the ultimate solution?

Naturally, using more accurate test data results in more accurate tests. Consequently, the best test data to use is probably your actual production data. Simple, right?

Unfortunately, it is not. Due to (privacy) regulations, organizations are not allowed to use actual production data as test data. Consequently, organizations have to build an effective test infrastructure with alternative solutions.

The test dilemma: production data is not allowed but the alternatives are suboptimal

Typically, we see companies having one of the following suboptimal and questionable test data ‘solutions’ in place (click to open):

With anonymization or masking, one manipulates original data to hinder tracing back individuals. For example, one deletes parts of the data, generalizes the data or hussles rows and columns. If you do this to a small extend, it can still be possible to regenerate original parts of the data, which triggers privacy risks. If you do this more thoroughly, you destroy your data and lose valuable data quality and business logic. Since this trade-off always exists, anonymizing or masking is not a great way to go, since it results in a sub-optimal combination between incurred risks and data quality.

Moreover, we see that this is often a cost- and time-intensive process, because these techniques work different for each dataset and for each data type. This introduces internal discussions about how to apply these techniques and the level of risk mitigation that you must achieve. Since the data format changes frequently, you must start-over again.

What if we scramble our data? This classic technique is relatively simple to apply. However, this technique results in un-representative data for test purposes. This is mainly because it mixes up all the business logics that one actually would like to test. Furthermore, unique variables and/or outliers might be mixed up, but a risk of de-identification remains (e.g., scrambling the salaries of your company: probably your CEO has the highest salary, so you will be able to spot him).

Dummy data is randomly created data, usually by hand. Although this is randomly generated and does not contain any risks, it is not representative and business logic . It does not present any statistical viability and relevant business logic, since the data is not real. Additionally, creating dummy data is often a cost- and time-intensive process.

Although the options stated above are suboptimal, they are still widely used, because there is a lack of better alternatives.

 Data quality(Privacy) risksRequired effort
Production dataHighHighLow

Anonymized data or masked data

MediumMediumHigh
Scrambled dataLowMediumLow
Dummy dataLowLowHigh

3 key test data challenges, when you cannot use your original production data as test data

Business logic and referential integrity is not preserved

Alternatives require manual work, are not scalable and cost time

Cannot optimize your data for testing purposes

Our solution: test and develop with AI generated synthetic test data to deliver state-of-the-art software solutions

Syntho is expert in end-to-end synthetic data generation and implementation. We excel in both generating (1) synthetic data twins and supporting various (2) synthetic data optimization, augmentation and simulation features. Both can be used to generate synthetic test data so that you can test, develop and deliver state-of-the-art software solutions.

The difference explained: twin versus optimization, augmentation and simulation

Generating a Synthetic Data Twin

Synthetic data twin

When generating a Synthetic Data Twin, Syntho mimics the original data as closely as possible while realizing privacy. Syntho generates completely new datapoints and models them in such a way that the properties, relationships and statistical patterns of the original data are preserved. Even complex, hidden patterns, relationships and inefficiencies are captured, so the synthetic data can be used as a direct alternative to the original data.

Data Optimization and Augmentation features

Data optimization, augmentation and simulation

The foundation for Synthetic Data Optimization and Augmentation is a Synthetic Data Twin. From this foundation, we can optimize and augment your data using smart generative AI based on the requirements, logic and constraints of your business. We offer various value-adding synthetic data optimization and augmentation features to take your (both ‘dirty’ or ‘clean’) data to the next level. 

The value AI generated synthetic test data to deliver state-of-the-art software solutions

High quality test data with preserved business logic and referential integrity

Our software generates completely new datapoints and models these new datapoints in such a way that we preserve properties, relationships and statistical patterns to guarantee high quality with preserved business logic. With multi-table databases and many different software applications, we sustain referential integrity. This means that person A with properties XYZ in dataset 1 is the same person A with properties XYZ in dataset 2, 3, 4 etc.

Easy and fast data generation with state-of-the art AI

Instead of building datasets by hand, we can generate large datasets and complex databases fully automatically with our AI software, without any additional required knowledge. Here, we know that data structures, datatypes and data observations change over time. That’s why the value of being able to update the test data of your complete test infrastructure within minutes is so crucial.

Optimized test data for edge case testing

Testing edge cases is crucial. However, edge cases typically do not happen often and hence, relevant data to test these scenarios is scarce. We can solve this with our data optimization, augmentation and simulation features to generate extra edge case related data. This allows you to optimize your test data and support you when testing edge cases.

How does this work? Suppose you have 33% females and 66% males. We can generate extra synthetic data of a certain subset to tweak and balance the dataset to your own test preferences. In this example we would generate extra synthetic female data records to balance back to 50% females and 50% males.

Generate new test data for new features, when your do not have data yet

When you develop new features, you typically do not have data yet. Hence, you cannot test, develop and deliver your software solution. Another example, you have combinations of datapoints that did not happened in the past, but could happen in the future. Hence, you do not have this data, but you would like to make sure that you have tested your software solution in these scenario’s.

We can solve this with our data optimization, augmentation and simulation features to generate smart synthetic data. This allows you to test, develop and deliver in scenarios that you otherwise would do with no data or with manual created data.

Data balancing with Synthetic Data
Software test data with synthetic data new extra

The result: next-level test data and test infrastructure to deliver state-of-the-art software solutions

Synthetic data as test data

The % increase of our clients ability to...

Synthetic data generation software dashboard from syntho
...Improve overall test, development and delivery quality
87%
...Release faster and shorten the time-to-market
74%
...reduces bugs and potentially unhappy customers
84%
...Happy testers, developers and product owners
88%

Test and develop with next-level test data to deliver state-of-the-art software solutions!

Contact syntho and explore the value of synthetic data with us