List of synthetic data startups and companies — 2021
Are you looking for a synthetic data company? Or simply seeking an overview of this fast-growing market? Search no more. Here is a list of companies providing structured or unstructured synthetic data products and services.
*2022 edit: for the 2022 list, check out “List of synthetic data vendors 2022” and for an up-to-date directory https://syntheticdata.carrd.co/
I’ve divided the lists into 1) providers of synthetic data for structured data (tabular data) and 2) providers of synthetic data for unstructured data (image & video). I’ve added at the end comments on the ecosystem growth and investment trends. For more details on the structured/unstructured segmentation, check out my post “types of synthetic data and real-life examples”.
Disclaimer: I work at Statice, one of the synthetic data providers listed in this post.
1. Synthetic data providers for structured data
The companies listed below offer synthetic data that is generated from tabular data. It mimics real-life data stored in tables and can be used for behavior, predictive, or transactional analysis.
Most vendors in this category offer some sort of privacy guarantees, meaning that mechanisms in the synthetic data are meant to prevent the re-identification of an individual from the original data.
Note that the level of data protection can vary from one vendor to another.
- Betterdata: vendor of a privacy-preserving synthetic data solution for AI, data sharing, or product development.
- Datomize: vendor of a synthetic data solution for the development, training and testing of AI/ML models, and applications.
- Diveplane: vendor of Geminai, a solution to generate synthetic ‘twin’ datasets with the same statistical properties as the original data.
- Facteus: vendor of Mimic™ a synthetic data engine to synthesize data assets that protect consumer privacy.
- Generatrix: an AI-based privacy-preserving data synthesizing platform.
- Gretel: vendor of a synthetic data generation library and APIs for developers and data practitioners.
- Hazy: vendor of a synthetic data platform for financial institutions that want to conduct data analysis.
- Instill AI: vendor of a solution for synthetic data generation leveraging Generative Adversarial Networks and differential privacy.
- Kymera Labs: vendor Synthetic Data Fabrication Software, a solution that generates new data without relying on the ML/GAN approach.
- Mirry.ai: vendor of a synthetic data platform for generating synthetic data using GANs, available in Community, Cloud or Enterprise editions.
- Mostly AI: vendor of Mostly Generate, a synthetic data generator that provides as-good-as-real, yet fully anonymous data.
- Oscillate.ai: vendor of synthetic data API solution which is deployed internally.
- Replica Analytics: vendor of Replica Synthesis, a software solution that ingests data and builds synthesis models to generate synthetic datasets.
- Sarus technologies: vendor of ML software to help data practitioners leverage sensitive data assets for innovation with privacy guarantees.
- Sogeti: vendor of Artificial Data Amplifier (ADA), a solution by the Sogeti Testing AI team that generates realistic data based on real data sets.
- Statice: vendor of a software solution that generates privacy-preserving synthetic data that can be used as a drop-in replacement for an original dataset.
- Syndata AB: vendor of a synthetic data generator to generate data sets that match the statistical attributes of real data but are entirely synthetic.
- Synthesized: vendor of a DataOps platform enabling data sharing and collaboration across internal groups, remote teams, and external partners.
- Syntheticus: Swiss vendor of a Swiss platform dedicated to generating synthetic data.
- Syntho: vendor of AI software for generating synthetic data.
- Tonic: vendor of a synthetic data generator to mimic production data.
- Ydata: vendor of a synthesizer that mimics statistical information from real data and on new datasets without transforming the original data.
In addition to the privacy guarantees, some vendors focus on a given industry, such as healthcare. It’s the case of:
- Kerus Cloud: software from Exploristics to generate synthetic datasets for use in a various analytics applications in the life sciences sector.
- MD Clone: vendor of MDClone ADAMS, a self-service data analytics environment enabling healthcare collaboration, research, and innovation.
- Octopize MD: vendor of “The Avatar solution”, a synthetic data software and service for healthcare data.
- Pionic: vendor of software to transform medical data into tradable assets without compromising patient privacy.
- Syntegra: vendor of a synthetic data engine, purpose-built for healthcare, to synthesize replicas of medical data.
- Veil.ai: vendor of an anonymization engine that can be used for data de-identification.
Other providers of structured synthetic data focus on a different use-case: the production of test data. In these cases, privacy is not taken into consideration when generating synthetic data, neither is the perfect preservation of statistical property. Still, some realisticness is maintained as we are not talking about dummy data generation.
- BizDatax’s Synthetic Data Generator: a data masking solution for production data with a synthetic data generator by Ekobit.
- Test Data Manager: a test data management solution from Broadcom with synthetic data capabilities for production data.
- Chatterbox’s Synthetic Data Generator: an AI software with a synthetic data generator to validate AI models.
- Curiosity software synthetic test data: a test data automation solution to generate test data.
- ExactData: vendor of Smart Data, a solution to reduce cost and time to develop, test, deploy, train, and maintain data processing systems.
- GenRocket: vendor of enterprise synthetic test data solutions.
- iData: a data quality tool from IDS that incorporates data generation and obfuscation capabilities.
- Informatica: vendor of a test data solution with synthetic data capabilities.
- Synth: a tool from OpenQuery for generating realistic data using a declarative data model.
- TrialTwin: service provider for synthetic health data for patient and clinical trial analysis.
All of the companies above are commercial ones. But I want to point out that there is a great deal of research and resources available in the academic field and in the open-source community. These projects are freely accessible to anyone wishing to try their hand at synthetic data generation.
The Synthetic Data Vault lists numerous projects, libraries, and tutorials. Other open-source synthetic data tools and projects include:
- Smart noise: an open-source toolkit designed to be a layer between queries and data systems, relying on differential privacy.
- Twinify: a software package for privacy-preserving generation of a synthetic twin to a given sensitive data set.
- Synner: an open-source tool to generate real-looking synthetic data by visually specifying the properties of the dataset.
- Synthea: an open-source, synthetic patient generator that models the medical history of synthetic patients.
- Synthetig: an open-source platform where you can generate synthetic data.
For a full overview of the open-source ecosystem, check out part II of this other article.
2. Synthetic data provider for unstructured data
The companies listed below work with unstructured data, offering synthetic data products and services for vision and reconnaissance algorithm training.
- AI Reverie: provider of synthetic simulated 3D environments.
- Anyverse: vendor of synthetic data solution for perception models.
- Autonoma IA: provider of AI training data.
- Bifrost: provider of a synthetic data API that generates 3D worlds.
- CVEDIA: vendor of synthetic computer vision solutions for object recognition.
- Cognata: provider of simulations of ADAS and Autonomous Vehicle developers.
- Coohom Cloud: synthetic data service provider and developer of EUS, a synthetic data platform for training indoor agent cognition.
- Datagen: 3D simulated training data provider for Visual AI learning and development.
- Deep Vision Data: provider for synthetic training data for supervised and unsupervised training of machine learning systems .
- EdgeCase: Synthetic data provider for AI & image recognition.
- Lexset: training data provider for computer vision systems.
- Mindtech: vendor of a synthetic data platform for training data creation for AI vision systems.
- Neurolabs: vendor of a synthetic data platform for Computer Vision.
- Neuromation: vendor of a distributed synthetic data platform for deep learning applications.
- OneView: vendor of a synthetic geospatial data generation and optimization platform.
- Parallel domain: vendor of a synthetic data platform for autonomous system training and testing use cases.
- Reinvent Systems: provider and vendor of synthetic data solutions for image generation, text data, and 3D objects.
- Rendered AI: vendor of common application framework to produce physics-based synthetic datasets for AI training and validation.
- Scale Synthetic: the synthetic data offering from Scale AI for ML model training.
- Simerse: seller of annotated, synthetic, labeled training sets for AI learning.
- Sky Engine: a platform for synthetic data generation to create data streams for deep learning in computer vision.
- Synthesis AI: vendor of a data generation platform for computer vision.
- Synthetaic: provider of AI training data.
- Synthetic Data Pty Ltd: vendor of a simulation software that generates synthetic images.
- Synthetik: AI service provider with computer vision and machine learning expertise, including synthetic data generation.
- uSearch: a web search engine built entirely from AI-generated data.
- Vypno: vendor of a software solution to recognize and classify objects in images from drones, cameras, satellites or mobile phones.
- Yuva.ai: provider of synthetic image and video for computer vision.
- Zumo Labs: Training Data as a Service provider of synthetic training data for computer vision.
Market trends
As you can see with this list, we already have quite a few synthetic data vendors. The space is fast-growing. Out of the 58 companies listed here, 45% were created in the last two years. 2021 will most likely see its share of newcomers.
The available funding information reveals that at least* 78 $millions were injected into synthetic data companies in 2020, a 78% increase from 2019.
In 2018, the synthetic data companies received at least* 62 $millions, with at the time a large percentage of investments focusing on synthetic data providers for computer vision. Last year the companies providing structured synthetic data received more fundings.
In total, the synthetic data market has attracted over 210 $ million total investments in about five years. The data only includes the investments made public with transaction details. Some investments like Statice’s collaboration with PwC Germany are not reflected in it, for example.
*Data aggregated from publicly available information on Crunchbase, with fundings only for synthetic-data-only companies. See here for the data I use for the graphs.
Want to read more on synthetic data? Check out: