Hi, how can we help you today?

Browse our resources, search the knowledge base

Webinar: The Future of Test Data Management

Can Syntho detect and mask PII in text and unstructured data? Does Syntho work with unstructured data in general?

Yes, Syntho has a PII text scanner that can identify and mask PII in unstructured text data. For example, it can detect and replace PII in text fields, such as doctor’s notes, by tagging and obfuscating sensitive information like names, dates, and SSNs, while creating mock replacements.

More information can be found on this page under the “Introducing the PII text scanner” section.

As a finance company, data security is our top priority. Does Syntho support on-premise deployment, and if so, are all features available on-premise?

Yes, we facilitate on-premise deployments and all features are available on-premise.

Are synthetic data generated “in compliance” with implicit business rules? In other words, is the generator capable of inferring business rules?

Yes, Syntho’s AI-powered generation automatically captures patterns and complex relationships between columns, reproducing them in the generated synthetic data.

Additionally, Syntho offers rule-based synthetic data methods, including calculated columns, to model business rules from scratch, e.g. for cases where you don’t have any data yet.

Can we download the PII scan report in Excel or Notepad, or is it only viewable in the tool?

It is both viewable in the tool, as well as there is an option to export it as text.

Can Syntho generate synthetic versions of complex relational datasets (beyond simple tree structures)?

Syntho’s Test Data Management solutions are designed to mask and de-identify sensitive data at scale, including complex relational datasets. Syntho’s consistent mapping feature is important to realize preserving consistency and referential integrity for complex relational datasets and works across tables, across databases, across systems and even over time.

How do you check the validity of mock data?

Syntho offers over 150 mock data generators that accurately mimic real-world data characteristics. Rule-based synthetic data can also be customized to suit specific requirements.

Can PII information be detected and adapted?

Yes, Syntho can detect and adapt PII data as configured during setup and as demonstrated during the webinar.

More information about our PII scanner can be found here.

More information about our mockers to adapt PII can be found here.

Does Syntho have the capability to handle Blobs?

Syntho supports handling Blob data, both by duplication and exclusion of such columns. Details can be found in our User Documentation. We can deepdive further into this with you, if desired.

How do you make sure that all PII like birthdate is detected?

The PII scanner detects all PII attributes and identifiers. While a birthdate alone may not uniquely identify an individual, you can customize the scanner to include attributes like birthdate and other variables as needed. Then, our PII scanner can also detect non-identifiers such as the birthdate.

The scanner offers both “shallow” and “deep” scans: a shallow scan reviews metadata, such as column names and data types, while a deep scan leverages advanced entity recognition to analyze actual data in depth. This flexibility allows you to specify which PII types to detect.

Why should mock data, even if it is PII-related, be protected?

PII, or Personally Identifiable Information, refers to sensitive data linked to individuals. Privacy regulations make it challenging to use personal data for testing purposes, so it is essential to protect this data accordingly.

PII Scanner

Can I also identify PII manually?

Yes, users can also identify PII entities manually as an alternative to the PII scanner. Users can also apply mockers manually as an alternative to the automated suggested mockers. However, we optimized our platform in such a way that AI does the work for you to mitigate manual work and to be able to process large data volumes quickly.

Why do organizations use the PII column scanner?

To initiate de-identification, identifying columns containing Personally Identifiable Information (PII) is essential. However, this often demands extensive time and manual effort from developers.

Our solution streamlines this process through an automated PII scanner, allowing customers to efficiently identify and de-identify PII with our AI-powered PII scanner. Our advanced AI-powered solution eliminates manual efforts, enhancing efficiency and ensuring comprehensive identification of sensitive data automatically.

PII definition

PII stands for Personally Identifiable Information. PII is unique for every individual and only one person shares the same trait. Learn more about the definition of PII here.

Webinar: Secure Data, Smarter Testing

Can Syntho be integrated into a full-stack data platform for companies?

Yes, Syntho can be integrated into a full-stack data platform. While we’d love to hear more details about your specific use case, we often support integrations through our REST API.

In addition to the UI-based platform, our REST API allows you to automate and integrate all functionalities, enabling seamless integration within existing data pipelines or test environments. If you’re interested in exploring this further, we’re happy to discuss how Syntho can fit into your architecture.

How does Syntho typically handle deployments?

Our deployment process follows a structured onboarding phase, ensuring organizations successfully integrate and use the platform. This process includes:

Requirement gathering – Understanding the specific needs of the organization
Step-by-step deployment – Ensuring seamless integration into existing systems
Syntho Bootcamp – A training program that equips teams with the knowledge to effectively use the platform

If you’re interested in a demo or a deep-dive session, feel free to reach out to us.

How does Syntho’s deployment work cost-wise in a client’s cloud environment?

Since Syntho is deployed within the customer’s infrastructure, it does run on the organization’s hardware.

To provide clarity on resource requirements, we have detailed hardware specifications available in our user documentation.

Typical starting requirements include:

12–20 virtual CPUs
32GB of memory
128GB of disk storage

However, the exact resource requirements depend on the amount of data being processed. We will share the documentation with the relevant details to help organizations estimate their infrastructure needs.

Is this technique also possible for a hospital that works with Chipsoft?

Yes, this is definitely possible. We have spoken with multiple medical organizations that use Chipsoft software.

One of the key challenges we’ve observed with Chipsoft is that while organizations still possess the data, the relationships between tables are stored on the Chipsoft side. This presents a challenge with foreign keys, which are essential for maintaining relationships between tables.

To address this, our platform offers a solution that allows you to add foreign keys automatically to the generated test data. Instead of manually adding them one by one, you can import foreign keys using a JSON file, and run a foreign key scan within the platform to detect and apply relationships

This ensures that your test data retains the necessary structure while streamlining the process.

Webinar: Accelerate CI/CD in Finance

What kinds of test data use cases is the platform typically used for?

We’ve actually put together a full deck that outlines the top 16 test data use cases we see most often with our clients. It covers not just what the use case is, but also the pain points teams are dealing with today and how we solve them.

The platform is used for all the expected things like regression testing, API testing, security testing, performance and load testing, but also for creating demo environments with production-like data. That’s especially important when showcasing a solution or feature. When it comes to training AI and machine learning models, having clean, realistic test data is critical, and we support that too.

A lot of organizations know they need to get serious about compliance and want to speed up innovation, but they don’t know where to start. That’s exactly why we created the use case deck, to help teams get going quickly, with real clarity on how to move forward.

If you’d like to get a hold of this deck, feel free to reach out to stephan@syntho.ai

What functionalities do customers use the most when managing test data?

Connecting to databases and generating the actual data are the two most-used features. We offer a lot of connectors, so it’s easy for users to plug into their systems and get going.

When it comes to the generators, usage really depends on the use case. AI-powered generation is especially useful when statistical realism matters. For example, in analytics or model training scenarios. But for standard test data needs, format-specific generators (like for social security numbers) are used often. We also see a lot of teams relying on features that help maintain consistency across multiple systems, which is really important in enterprise environments.

How can test data be created quickly, and what sets your approach apart from manual methods?

One major bottleneck is identifying sensitive data across large, complex databases. To address this, we developed an AI-powered PII scanner that detects sensitive data automatically, even when it’s not obvious from column names. It also suggests how to replace this data and apply the right generators instantly, streamlining the entire process which can save a lot of time. We also provide a wide variety of generators, which means no matter the scenario, there’s always an efficient path forward.

We also see that a lot of the clients we work with have struggled with older, scripting-based solutions. In those setups, even small changes require developer involvement and take weeks to roll out. With our platform, that kind of bottleneck doesn’t really exist as you can move fast without waiting on a dev team to write or update scripts every time.

Synthetic Data

What is the difference between synthetic data (a synthetic data twin) and mock data?

Mock data and AI-generated synthetic data are both types of synthetic data, but they are generated in different ways and serve different purposes.

Mock data is a type of synthetic data that is manually created and is often used for testing and development purposes. It is typically used to simulate the behavior of real-world data in a controlled environment and is often used to test the functionality of a system or application. It is often simple, easy to generate, and does not require complex models or algorithms. Often, one referrers also to mock data as “dummy data” or “fake data”.

AI-generated synthetic data, on the other hand, is generated using artificial intelligence techniques, such as machine learning or generative models. It is used to create realistic and representative data that can be used in place of real-world data when using the real-world data would be impractical or unethical due to strict privacy regulations. It is often more complex and requires more computational resources than manual mock data. As result, it is much more realistic and mimics the original data as close as possible.

In summary, mock data is manually created and is typically used for testing and development, while AI-generated synthetic data is created using artificial intelligence techniques and is used to create representative and realistic data.

Do you support mockers and mock data?

Yes we do. We offer various value-adding synthetic data optimization and augmentation features, including mockers, to take your data to the next level.

What do you mean by generating a ‘synthetic data twin’?

A synthetic data twin is an algorithm-generated replica of a real-world dataset and / or database. With a Synthetic Data Twin, Syntho aims to mimic an original dataset or database as close as possible to the original data to create a realistic representation of the original. With a synthetic data twin, we aim for superior synthetic data quality in comparison to the original data. We do this this with our synthetic data software that uses state-of-the-art AI models. Those AI models generate completely new datapoints and models them in such a way that we preserve the characteristics, relationships and statistical patterns of the original data to such an extent that you can use it as-if it is original data.

This can be used for a variety of purposes, such as testing and training machine learning models, simulating scenarios for research and development, and creating virtual environments for training and education. Synthetic data twins can be used to create realistic and representative data that can be used in place of real-world data when it is not available or when using the real-world data would be impractical or unethical due to strict data privacy regulations.

What are typical synthetic data use cases?

Generally, most of our clients use synthetic data for:

Software testing & development
Synthetic data for analytics, model development and advanced analytics (AI & ML)
Product demos

Data Quality

Do you preserve referential integrity over multi-table databases?

Yes we do. Our platform is optimized for databases and consequently, the preservation of referential integrity between datasets in the datgabase.

Curious to find out more about this?

Ask our experts directly.

Is the quality of AI generated synthetic data good enough for advanced analytics (e.g. AI, ML, BI)?

Yes it is. The synthetic data even holds patterns of which you did not know they were present in the original data.

But don’t just take our word for it. The analytics experts of SAS (global market leader in analytics) did an (AI) assessment of our synthetic data and compared it with the original data. Curious? Watch the whole event here or watch the short version about data quality here.

How does Syntho demonstrate the quality of generated synthetic data?

Guaranteeing that synthetic data holds the same data quality as the original data can be challenging, and often depends on the specific use case and the methods used to generate the synthetic data. Some methods for generating synthetic data, such as generative models, can produce data that is highly similar to the original data. Key question: how to demonstrate this?

There are some ways to ensure the quality of synthetic data:

Data quality metrics via our data quality report: One way to ensure that synthetic data holds the same data quality as the original data is to use data quality metrics to compare the synthetic data to the original data. These metrics can be used to measure things like similarity, accuracy, and completeness of the data. Syntho software included a data quality report with various data quality metrices.
External evaluation: since the data quality of synthetic data in comparison to original data is key, we recently did an assessment with the data experts of SAS (market leader in analytics) to demonstrate the data quality of synthetic data by Syntho in comparison to the real data. Edwin van Unen, analytics expert from SAS, evaluated generated synthetic datasets from Syntho via various analytics (AI) assessments and shared the outcomes. Watch a short recap of that video here.
Testing and evaluation by yourself: synthetic data can be tested and evaluated by comparing it to real-world data or by using it to train machine learning models and comparing their performance to models trained on real-world data. Why not test the data quality of synthetic data by yourself? Ask our experts for the possibilities of this here.

It’s important to note that synthetic data can never guarantee to be 100% similar to the original data, but it can be close enough to be useful for a specific use case. This specific use case can even be advanced analytics or training machine learning models.

Privacy

What does the Dutch Data Protection Authority say about using synthetic data?

One of the use cases that is specifically highlighted by the Dutch Data Protection Authority is using synthetic data as test data.

Syntho Engine

Will the referential integrity be preserved when I have a database?

Yes. Syntho software is optimized for databases containing multiple tables.

As for this, Syntho automatically detects the data types, schemas and formats to maximize data accuracy. For multi-table database, we support automatic table relationship inference and synthesis to preserve referential integrity.

Do I need a GPU to use Syntho?

No, we optimized our platform to minimize computational requirements (e.g. no GPU required), without compromising on the data accuracy. In addition, we support auto scaling, so that one can synthesize huge databases.

wWhich data types do you support?

The Syntho Engine works best on structured, tabular data (anything that contains rows and columns). Within these structures, we support the following data types:

Structures data formatted in tables (categorical, numerical, etc.)
Direct identifiers and PII
Large datasets and databases
Geographic location data (like GPS)
Time series data
Multi-table databases (with referential integrity)
Open text data

Complex data support
Next to all regular types of tabular data, the Syntho Engine supports complex data types and complex data structures.

Time series
Multi-table databases
Open text

Are specific skills required do use the Syntho Engine?

Not at all. Although it may take some effort to fully understand the advantages, workings and use cases of synthetic data, the process of synthesizing is very simple and anyone with basic computer knowledge can do it. For more information about the synthesizing process, check out this page or request a demo.

How many training records do I need to synthesize my data?

Syntho’s machine learning algorithms can better generalize the features with more entity records available, which decreases the privacy risk. A minimum column-to-row ratio of 1:500 is recommended. For example, if your source table has 6 columns, it should contain a minimum of 3000 rows.

How long does it take to generate synthetic data?

Naturally, the generation time depends on the size of the database. On average, a table with less than 1 million records is synthesized in less than 5 minutes.

How do you connect the Syntho Engine with your data?

Syntho enables you to easily connect with your databases, applications, data pipelines or file systems.

We support various integrated connectors so that you can connect with the source-environment (where the original data is stored) and the destination environment (where you want to write your synthetic data to) for an end-to-end integrated approach.

Connection features that we support:

Plug-and-play with Docker
20+ database connectors
20+ filesystem connectors

Which deployment options do you support?

The Syntho Engine is shipped in a Docker container and can be easily deployed and plugged into your environment of choice.

Possible deployment options include:

On-premise
Any (private) cloud
Any other environment

Build better and faster with synthetic data today

Unlock data access, accelerate development, and enhance data privacy.

Book a demo Contact us

Join our newsletter

Keep up to date with synthetic data news