Synthetic data tools: Open source or commercial? A guide to building vs. buying

Published in

Statice

8 min readSep 26, 2022

If you are looking into synthetic data, you might wonder whether it’s better to use an open source tool or buy a solution from a synthetic data vendor. This article draws on our team’s five years of experience supporting enterprises with structured synthetic data to provide some answers on the benefits, pitfalls, and costs of building versus buying.

The data shared in this article is only here as a suggestion. Various unique factors influence these estimates, including your use case, industry, team size, or geography. This article was originally published at https://www.statice.ai and authored by Evgeniya Panova.

‍ The use of synthetic data is growing across many industries, and you may wonder how to get started. Gartner expects synthetic data to replace real AI data by 2030. If it’s not already the case, chances are that your team will be looking into it soon enough.

To help you, we explored the aspects of synthetic data projects and highlighted elements to keep in mind when researching open source or commercial synthetic data solutions.

The first section provides an overview of open source and commercial synthetic data solutions.
Section two covers what you need to know about a synthetic data project. ‍
In the complete version of our guide, we added estimation of costs.

An overview of open source and commercial synthetic data solutions

‍

Today, structured synthetic data generation software includes:

Commercial vendors’ software: platforms and frameworks that plug into your data pipeline and provide synthetic dataset generation and evaluation functionality out-of-the-box. Most offer privacy features with mechanisms to prevent the re-identification of an individual from the original data. Commercial vendors offer SaaS, professional services, support, and licensing based on monthly or annual fees. Some vendors offer free trials or plans.
Open source tools: code and toolkit that you can use to build your synthetic data solution. It is mostly free or low-cost to use open source solutions, so they are an attractive option for projects with a smaller budget, and you can get started with many of the tools using their communities and tutorials.

List of open source synthetic data tools

Copulas: Python library for modeling multivariate distributions and sampling from them using copula functions.
CTGAN: SDV’s collection of deep learning-based synthetic data generators for single table data.
DataGene: Tool to train, test, and validate datasets, detect and compare dataset similarity between real and synthetic datasets.
DoppelGANger: Synthetic data generation framework based on generative adversarial networks (GANs).
DP_WGAN-UCLANESL: This solution trains a Wasserstein generative adversarial network (w-GAN) that is trained on the real private dataset.
DPSyn: Algorithm for synthesizing microdata while satisfying differential privacy.
Generative adversarial nets: a repository for synthetic time-series data generation using generative adversarial networks (GANs).
Gretel.ai: Commercial synthetic data vendor that offers open source functionality.
mirrorGen: Python tool that generates synthetic data based on user-specified causal relations among features in the data.
Open SDP: Community for sharing educational analytic tools and resources.
Pydbgen: Python package that generates a random database table based on the user’s choice of data types.
Smart noise synthesizer: Differentially private open source synthesizer for tabular data.
Synner: Tool to generate real-looking synthetic data by visually specifying the properties of the dataset.
Synth: Data-as-code tool that provides a simple CLI workflow for generating consistent data in a scalable way.
Synthea: Synthetic patient generator that models the medical history of synthetic patients.
Synthetic data vault (SDV): One of the first open source synthetic data solutions, SDV provides tools for generating synthetic data for tabular, relational, and time series data.
TGAN: Generative adversarial training for generating synthetic tabular data.
Tofu: Python library for generating synthetic UK Biobank data.
Twinify: Software package for privacy-preserving generation of a synthetic twin to a given sensitive dataset.
YData: Synthetic structured data generator by YData, a commercial vendor.

You can also find the description of all open source solutions in our guide or in this Github Awesome List. For the overview of commercial synthetic data companies, head to this other Medium post.

Synthetic data project assessment: Criteria for building vs. buying

‍

What aspects should you consider when deciding whether to build or buy? Your new synthetic data generator will become a part of an existing data lifecycle or a use case you are developing. The specifics of your enterprise data lifecycle or project will impact your solution choice.

‍

Data access

‍

Quickly accessing and sharing data with stakeholders is often the difference between a successful project and a failure. How you acquire data (externally, internally), where you original is stored (enterprise databases, excel files) and how often you need to generate synthetic data will impact the synthetic data approach. Consider how you will handle potential data access issues now and in the future.

‍Choose a commercial vendor if you need speed and plug-and-play functionality for managing data access levels and roles. Big companies with data projects involving cross-departmental work and/or third parties will find it easier to use commercial vendors for their enterprise-ready capabilities and data connectors.
‍Choose an open source solution if you are developing or testing one specific use case that you do not intend to scale, open source offers you the ability to develop a 100% custom data access functionality that will precisely address your needs without the need to purchase yearly or monthly commercial vendor licenses.‍

Data preparation

Data preparation is among the most time-consuming and important phases of many data projects. To train machine learning models with synthetic data, data scientists have to prepare the original data for synthesization.

Commercial vendors, who work directly with various types of customers and use cases, tend to offer out-of-the-box functionality for a broader range of use cases and data types. Having the necessary support and expertise in a wide range of issues is a significant advantage of commercial solutions.

While commercial vendors offer automated pre-processing features, some open-source tools might require you to prepare your original data for synthesization manually.

‍Choose a commercial vendor if you have complex datasets that require a lot of pre-processing, custom rules, and possibly automation of pre-processing tasks.
Choose an open source tool if you are working with small and simple datasets, apply straightforward rules, and do not plan to scale your project.‍

Synthetic data quality and utility assessment

‍

Data utility refers to the analytical completeness and validity of the data. Synthetic data utility requirements are closely tied to your use case. If you plan on using your synthetic data for machine learning model training or analytics, it requires evaluating its quality and utility first.

Choose a commercial vendor if you have complex data and need to validate the results quickly. You can run the utility evaluations of commercial vendor tools on datasets of varying complexity, and the output can be adjusted and controlled. Additionally, some commercial vendor solutions offer ways to assess the performance of machine learning models.
‍Choose an open source solution if you have simple datasets or the time and expertise to cover a wide range of situations encountered in different datasets. Open source can be a good option when you require specific utility metrics and don’t need good utility guarantees.‍

‍

Synthetic data privacy assessment

‍

Another crucial aspect of data access is privacy. To share synthetic data built out of data that contains personal information, you need to ensure it can withstand re-identification attacks. When companies use synthetic data as an anonymization method, the biggest question around synthetic data is how to assess the privacy risks.

Privacy is an empirical field, and without experts, it is hard to assess the risks, run privacy attacks to comply with privacy laws, and get the approval of the DPO (Data Protection Officer).

Keep in mind that building a privacy evaluation is also time-consuming. Depending on the complexity of the use case, it can take between 3 to 6 months to research, develop, test, and approve synthetic data privacy with or for a DPO.

‍Choose a commercial vendor if you don’t have the time, resources or specialized knowledge to develop complex privacy evaluations in-house. The commercial vendor is a better option when your synthetic datasets are generated based on sensitive original data (for example, customer data containing personally identifiable information) and need to be shared across multiple departments or external stakeholders. This is because privacy evaluators are already built-in and tested.
‍Choose an open source tool if you need simple privacy measurements. Some open source solutions offer them. However, those metrics might not be robust enough to provide comprehensive and legally meaningful evaluations of the privacy risks that compliance professionals can understand. Open source will be a good fit if you don’t need to involve DPOs or demonstrate compliance in your project. Suppose you build your synthetic data solution based on open source tools and need a strong privacy guarantee. In that case, we recommend involving data privacy experts to develop and verify the privacy evaluations you need.‍

Ease of use

Finally, think of who needs access to a synthetic data platform in your team or company. Sometimes, it is not just data scientists but DPOs, managers or even CEOs.

Choose a commercial vendor when you have a specialized case like healthcare data or when non-technical team members need to access the tool. Commercial synthetic data companies typically offer platforms with graphical user interfaces and expert support. The benefit is that you don’t have to be a technical user to take advantage of this, and you can support custom data types and extend functionality if needed.
‍Choose an open source tool when you want 100% control over the functionality and independence of third-party software. Go for open source solutions that have communities around them to get support when needed. For technical users, open source tools are relatively easy to use. Some open source solutions have Discord or Slack channels where users can ask questions and solve issues collectively. ‍

Taking a hybrid approach

Sometimes you don’t have to choose between open source and commercial vendors because you can take advantage of both with a hybrid approach.

For instance, you’re happy with what you’ve built in-house, but your project scales. Therefore, it may be necessary to perform extensive privacy evaluations to share your synthetic data. In this case, you can perform privacy assessments using scientifically proven commercial vendor functionality.