Synthetic Data Features

Extended features of the synthetic data generation platform 

The most requested additional features

The PII Column scanner is a feature that automatically detects Direct Personally Identifiable Information (PII) in a user’s database via a Shallow scan or AI-powered Deep scan. The AI engine also suggest mockers for each PII entity as replacement.

For open text, Syntho’s PII scanner feature helps organizations identify direct personally identifiable information (PII) not only in databases, but also in open text data. Identified PII entities can be removed or replaced by the entity placeholder or mock value.

Mock data is a substitute for real or sensitive information to replace direct identifiers. Advanced mockers can be used to generate synthetic data from scratch or based on pre-defined rules, automating the process of data generation to reduce time and effort required.

Syntho offers out-of-the-box connectors for an easy configuration of synthetic data generation jobs and connections to source and target environments, integrating with over 20 database and 20 filesystem connectors for an end-to-end integrated approach.

Database subsetting is the process of creating a smaller or larger representative subset of a database with preserved referential integrity. Subsetting can help businesses to expand their data or reduce computation costs by creating smaller data subsets.

Workspace sharing enables organizations to collaborate and scale their use of synthetic data. It allows teams to work together or separately within the same workspace, with different levels of access and permissions based on their roles.

Deep dive into the highlighted additional features

PII column scanner and mockers

What is the PII Column scanner?

The PII Column scanner feature allows users to automatically discover direct Personally Identifiable Information (PII) in their database. This feature has two scan options: (1) Shallow scan (only metadata, including column names) and (2) Deep scan (metadata + and the data itself).

Shallow PII scan

The Shallow scan applies regular expression rules to infer what type of PII each column may contain based on the column names.

Deep PII scan

The Deep scan also scans the data under each column to discover potential PII entities. However, this scan is a bit more time-consuming and resource-intensive.

All columns that are identified as PII are shown in the list of PII entities on the PII tab and are labeled PII on the column header on the Job Settings tab.

Apply mockers on PII automatically

The AI Engine of Syntho can automatically suggest the correct mocker for each PII entity, saving time and effort for the user. By using this feature, users can ensure that the sensitive original PII is protected, does not show up in the synthetic data and is replaced by representative mock data with preserved referential integrity for multi-table databases.

Manually PII detection and mocker configuration

Users can also identify PII entities manually as alternative to the Shallow scan and / or Deep scan. Users can also apply mockers manually as alternative to the automated suggested mockers. However, we optimized our platform in such a way that AI does the work for you to mitigate manual work and to be able to process large data volumes fast.

PII identification & obfuscation in open texts

PII Scanner on open text

Our PII scanner feature in the Syntho Engine helps organizations identify direct personally identifiable information (PII) within their databases, but now also in open tekst. PII includes information that could directly identify an individual, such as their name, address, social security number, email address, or phone number. Using a PII scanner can help organizations comply with data protection regulations also for open tekst such as GDPR, HIPAA or CCPA, which require measures to protect personal information.

Obfuscate identified PII

Identified PII can be removed, replaced by the entity, by a mocker or by a default value

Once PII is identified, the Syntho Engine offers three ways to protect the information:

  • removing the PII,
  • replacing the PII with an entity placeholder, or
  • replacing the PII with a mocker value. Removing PII simply deletes the information or replaces it with a default value or mocker.

Replacing PII with an entity placeholder maintains the structure of the data while protecting sensitive information. Replacing PII with a mocker value replaces the sensitive information with fictitious data that maintains the format of the original data.

With our PII scanner feature, you have a solution in place that now also work on open text.

Obfuscate PII

Advanced mockers

What is mock data?

Mock data is a substitute for real or sensitive information used for testing or other non-production purposes to replace direct identiefiers. It can be created using mockers and used in place of direct identifiers (PII) data to protect privacy and security.

For mockers, no advanced algorithms are used in comparison to AI Generated Synthetic Data. AI-generated synthetic data, on the other hand, is created using advanced algorithms and machine learning techniques. This type of data is designed to mimic real data with a high degree of accuracy.

Mockers to preserve referential integrity

As you do not want to have real identifiers in your AI Generated Synthetic Data, mockers are a great alternative for the identifiers. By using our “seed” function, one can match the same input to the same output across an entire data ecosystem / multi-table database to preserve the cardinality of a column, and match data across synthetic data jobs, tables, databases and systems to preserve the referential integrity.

Advanced mockers for rule-based generated synthetic data

Our platform offers a wide range of advanced mockers that can generate synthetic data from scratch or based on pre-defined rules. These mockers can be used to generate large datasets of realistic data for testing or training machine learning models. By automating the process of data generation, advanced mockers can help reduce the time and effort required to create large datasets, while also ensuring that the data generated is consistent and accurate. Additionally, advanced mockers can be customized to match specific use cases or scenarios, making them a versatile tool for generating synthetic data.

New connectors for an end-to-end integrated approach

New Connectors Syntho Engine

Out-of-the-box connectors

As we support various out of the box connectors that are included in our Syntho Engine, you will be able to easily configure your synthetic data generation job and connect the Syntho Engine to the source environment and the target environment. As a result, Syntho colleagues will never see your original data and will not require access to your Syntho Engine and your save environment.

Features that we support for integration

Syntho integrates with every leading database & filesystem:

  • Plug-and-play with Docker
  • 20+ database connectors
  • 20+ filesystem connectors


  • The illustration shows only some connectors that we support as example. The full list of supported connectors contains many more connectors.
  • Let us know if you miss a connector and we will build it for you!

Subsetting: create a smaller or larger representative subset of a database


What is Database Subsetting and why is it important?

Database subsetting is the process of creating a smaller (larger), representative subset of a larger (smaller) database with preserved referential integrity. This is done by configuring software like Syntho to include a specific percentage or selection of data. Whether you need to expand your data for more accurate analyses or reduce computation costs by creating smaller data subsets, Syntho Engine’s Generative AI technology makes it easy to achieve your goals. With its advanced subsetting capabilities, you can get the data you need to power your business without compromising on privacy, performance or accuracy.

Larger: create larger datasets with Syntho’s Generative AI technology

Do you need a larger dataset or database for your business needs? With Syntho’s Generative AI Syntho Engine, you can easily subset your existing data and create a larger dataset to meet your needs. Whether you want to perform more accurate analyses or train a machine learning model with more data, Syntho Engine has got you covered.

Smaller: Reduce Computation Costs with Smaller Data Subsets

On the other hand, dealing with huge amounts of data can result in high computation costs, especially when testing in a production environment. That’s where Syntho Engine comes in. With its subsetting capabilities, you can easily create smaller subsets of your data to reduce the computation costs while still getting accurate test results.

Workspace sharing to scale synthetic data for large organizations

Workspace sharing 1

What is workspace sharing?

Workspace sharing is a valuable feature for organizations looking to scale their use of synthetic data, involving multiple teams or individuals working with synthetic data. Workspace sharing is a feature that allows workspace owners and editors to delegate roles to other users within a workspace. This feature enables collaboration and improves workflow within a team by giving individuals specific access rights and permissions.

Why is workspace sharing relevant?

Workspace sharing is especially relevant for organizations that are looking to scale the use of synthetic data on a large scale. In such organizations, multiple teams or individuals may be working with different use cases and different aspects of the data generation and modeling process. By utilizing workspace sharing, these teams or individuals can collaborate and work in the same workspace, ensuring that everyone can work together or separately.

Available Workspace Roles:

When sharing a workspace with a user, you can select from the following roles:

  • Owner: has full control over the workspace, including editing and sharing access.
  • Editor: can edit workspace content, but cannot share access or delete the workspace.
  • Viewer: can view workspace content, but cannot edit or share access.
  • Loader: can Load data and configure connectors to the workspace, but cannot edit, delete, or share access.
  • Commenter: can add comments to the workspace, but cannot edit, delete, or share access.

By assigning roles to users, workspace owners can ensure that everyone has the right level of access and permissions to carry out their work effectively.

product manual

Save your product manual!