Why the future of test data for software development is synthetic data


The power of a well-functioning test infrastructure

Assume that your organization has a software solution that you want to improve. To effectively do so, you will have to develop and test new features thoroughly. Typically, product owners, software developers and software testers use test data to develop and test the software solution to make sure new features fit in existing business logic, that your software solution works smoothly and is bug-free . Naturally, using more accurate test data results in more accurate tests. Consequently, the best test data to use is probably the actual production data. Simple, right?

Unfortunately, it is not. Due to privacy regulations, organizations are not allowed to use actual production data as test data. So how can you build an effective test infrastructure without using the original production data?


Most organizations end up with inferior alternatives

Since using production data as test data is not an option, organizations have to spot alternatives. Let us look at typical alternatives that most organizations use:


Anonymized/scrambled data

  • With anonymization and scrambling, one manipulates original data to hinder tracing back individuals. For example, one deletes parts of the data, generalizes the data or hussles rows and columns. If you do this to a small extend, it can still be possible to regenerate original parts of the data, which triggers privacy risks. If you do this more thoroughly, you destroy your data and lose valuable data quality and business logic. Since this trade-off always exists, anonymizing or scrambling data is not a great way to go, since it results in a sub-optimal combination between incurred risks and data quality. Moreover, we see that this is often a cost- and time-intensive process and results in internal discussions about the level of risk mitigation.

Dummy data

  • Dummy data is randomly created data, usually by hand. Although this is randomly generated and does not contain any risks, it is not representative. It does not present any statistical viability and relevant business logic, since the data is not real. Additionally, creating dummy data is often a cost- and time-intensive process.

Both options stated above are suboptimal because they let organizations choose between high risk & high quality and low risk & low quality. Nevertheless, they are still widely used, because there is a lack of better alternatives. Without proper test data it is challenging to do proper testing and to do decent software development. Consequently, it is almost impossible for organizations to effectively test and develop their software using test data. Until now.


Synthetic data is here to save the day

Let me introduce you to synthetic data. When generating synthetic data, Syntho generates completely new datapoints that do not contain any privacy-sensitive elements of the original data. However, Syntho uses artificial intelligence to model those new datapoints in such a way that we preserve the characteristics, relationships and statistical patterns. This is what we call a synthetic data twin. We preserve statistical properties such that it provides the best possible alternative to using the original production data, even for complex analysis. Subsequently, you will have a synthetic dataset with preserved data quality and business logic that you can use for testing and developing without limits.

Synthetic data evaluation

Not just synthesizing, but also optimizing your test data

Additional to making a synthetic data twin, we also provide synthetic data features to optimize your data. The foundation for synthetic data optimization (and augmentation) is still a Synthetic Data Twin. From this foundation, however, we apply smart generative AI to modify the generated synthetic data based on your own preferences, requirements and demands for your use case. We support various value adding synthetic data features that allow you to optimize your data and to take your data to the next level. This is the future of test data for software development: you want no risks and be able to optimize your already high quality data. This can be a useful asset when testing and developing software.

Two examples of this are:


Edge cases casting

  • When testing software, you usually want to focus on edge cases that do not occur often. When synthesizing, we can increase or decrease the data related to those edge points, so that you can test and develop with richer data around relevant edge cases.

Simulating future scenarios

  • Imagine you would like to develop a new feature or that you have a feature where you lack sufficient data for. We can simulate data relevant for testing and developing future scenarios. Moreover, we can simulate data for datapoints that do not occur yet in your original data but that might do in the future.

These examples are a bit simplified, but they will give you a sense of the data optimization (and augmentation) services that we offer on top of creating a synthetic data twin. What test data does your organization uses for testing and developing and are you happy with that?


Do you or your organization work with test data? Perhaps you already have an idea of how synthetic data can benefit your organization, or in which manner we can optimize your data to improve testing cost and timelines. Maybe you are curious for more information, or even a bit skeptical. In any case, I would love to hear your thoughts, questions and suggestions on this. If you have any, please reach out to me and we will have a chat about the future of test data management at your organization!

Want to learn more about synthetic data? Save your spot for the SAS Open D[N]A Café webinar!

This webinar aims to answer frequently asked questions about synthetic data generation and its implementation. Not from the point of view of the generator or synthetic data (Syntho), but from the point of view of SAS, market leader in analytics.