Everything that happened in the synthetic data space in 2022

Elise Devaux
11 min readOct 6, 2022

--

For an up-to-date directory of synthetic data solution, check https://syntheticdata.carrd.co/

Over the past year, we saw notable growth in the synthetic data space and exciting market shifts. In this article, I’ve compiled updates from a year of market monitoring. Read about the new players, developments, and perspectives on how the ecosystem evolved.

Synthetic data new players and market analysis

When I published the 2021 synthetic data landscape post, there were 67 vendors:

  • 28 structured synthetic data vendors,
  • 10 synthetic test data vendors,
  • 6 open source vendors,
  • and 29 unstructured data vendors.

One year later, we’re looking at a new picture:

We’re adding 28 vendors to the map, bringing it to a total of 97 companies commercializing synthetic data products and services.

We’re adding 31 vendors to the map, bringing it to a total of 100 companies commercializing synthetic data products and services. 5 companies closed down, and I took the open-source solutions out of this map. Head to this article to browse the updated list of synthetic data companies.

In the structured data space, we have 17 new names if we count privacy-preserving and test synthetic data offerings together. These vendors are Aindo, Neutigers, Nuvanitic, Syntonym, Datacebo, Particle Health, Scale Synthetic, IvySys, Yet Analytics, DatProf, Esito, Accelario, Validata, Avo Automation, Broadcoam, Smart Data Foundry, Clearbox AI, Bulian AI.

The number of synthetic test data vendors has been booming. However, the companies interchangeably use the term “synthetic data”. Sometimes the vendors do rule-based simulated data (fake data), while others offer AI-generated synthetic data. In any case, the technology is rising, boosted by fewer privacy constraints for those developing fake data. It also builds on existing open-source bricks that streamline new capacity developments. Thus we’re now seeing DataOps, test data, or Data Automation software vendors adding synthetic or fake data capabilities to their solution offering, doubling down on the argument of privacy for test data.

Additions to the list in the unstructured data space include Infinity AI, vAIsual, Mirage Vision, Omniverse™ Replicator, Scale Synthetic, Datagrid, Kroop AI, Indika AI, CNAI, Deci, Alethea AI, Syntric AI, SBX Robotics.

Bitext appeared in the less dynamic area of synthetic text data and Deepsync in the synthetic audio space.

We continue witnessing a strong development of unstructured synthetic data solutions. A few characteristics explain the faster development of this side of the market:

  • the maturity of the use-cases such as computer vision training,
  • the availability of supporting technology such as image modeling software and games engines,
  • and the greater adoption of the technology in fast-growing industries such as automotive, retail, or video games.

Finally, the open source landscape expanded. We recently surveyed the open source ecosystem, and the list of tools added up to 20.

Strategic developments

The market shifts were the most exciting part of the market development. Below I am presenting 6 trends observed over the last few months in the synthetic data space:

  • Funding: investments have been a key element in the development of companies in recent months. They amounted to at least $325 million in the last 18 months.
  • Pure-players specialization: Pure-players specialization: after having known a homogeneous market for the first years, vendors are accentuating the differentiation with positioning choices on use-cases or industries.
  • Moving away from a pure product industry: we are seeing the arrival of services and new business models in an industry that until now has been very focused on a product-oriented offer.
  • In-house development: synthetic data is no longer the prerogative of pure players. In recent months, large companies have announced the development of in-house synthetic data capabilities.
  • Key partnerships: in parallel to internal developments of synthetic data capacities, other large companies have taken the path of partnerships, creating deals between pure players and big tech or specialized companies.

Funding

The total of publicly-known funding for synthetic data companies reached $328 million in the last 18 months.

The total of publicly-known funding for synthetic data companies reached $328 million in the last 18 months. That’s $275 million more than in 2020. Structured synthetic data companies raised:

Unstructured synthetic data companies received:

Pure-players specialization

There were two significant shifts for structured synthetic data vendors: industry specialization and use-case broadening with synthetic data for testing. Regarding industry specialization, the following companies specialized or positioned on a market niche:

  • FinCrime Dynamics presented Synthesizer®, a tool designed for the finance industry and fraud detection use cases.
  • Nuvanitic launched Nuvanitic IntelliHealth TM, a solution for the pharma industry specializing in synthetic clinical trial data.
  • VAIsual announced its synthetic data offer for the B2B IP licensing market.
  • IvySys launched a synthetic data generation tool for synthetic threat transactions.
  • Smart Data Foundry researches synthetic financial datasets to fraud fighting and financial institutions in the UK.

Additionally, in the synthetic data space, historical pure players broadened their use cases from privacy to test data generation. This closes the gap between the privacy-focused and the vendors historically focused on synthetic test data. For example:

The traction for this market segment, where synthetic data is sold as an alternative to test datasets or real data in testing environments, is growing too, as evidenced by the many references to synthetic data in job offers for QA engineers.

On the unstructured data side, the focus on applications is also evolving. As pointed out by a Reddit user, it’s slowly moving from the rather popular area of domain randomization, which supports the creation of multiple variations of scenarios or images, to applications that build more realistic-looking images, like synthetic brain imagery.

Services and new business models

While most companies have been developing software solutions since 2018/2019, we started to see services and marketplaces around synthetic data arrive on the market. APIs, marketplace and self-service synthetic data services might be ideal to meet more ad hoc needs and could accelerate technology adoption through streamlined testing processes.

Some companies have banked on simplified and quick access to synthetic data tools to boost adoption. As a result, a couple of freemium and self-service models have sprung up:

Interestingly, when it comes to self-service structured synthetic data, the offerings requiring data submission face the same data processing constraints that drove people to privacy-preserving synthetic data in the first place. To use personal data for such on-demand services, customers must have a legal basis and consent from the data subjects. As a possible answer, we’ve seen the development of the combinations of Privacy-Enhancing Technologies (PETs).

While dataset sales were common for unstructured synthetic data vendors, structured players also got into synthetic dataset sales. For example, GeoTwin proposes synthetic population datasets. With a different access model but a similar intention, synthetic data APIs and marketplaces also emerged in the past months:

Just as the lack of data is a challenge for many companies, the lack of training data to generate synthetic data is also problematic. Today, there is a clear added value in proposing synthetic data pools that combine the original data of several enterprises. However, even if we see the first APIs and marketplaces arriving, the legal, corporate, and technical barriers to these offers make them still rare, especially in Europe.

In-house development

There have been two types of development for synthetic data capabilities. On one side, privacy software vendors focusing on other protection techniques are now adding synthetic data capabilities to their toolbox to broaden the privacy technology offering for their customers.

On the other side, big tech and large companies are looking to develop their own synthetic data capabilities. Usually, when they develop structured synthetic data tools, they aim to improve data access and flow within departments and with their partners. For unstructured data, most cases have been around supporting the development of ML learning models.

Privacy software vendors adding synthetic data capabilities:

Large companies have already announced that they are using or developing technology or have shown intentions to do so. Compared to last year, many more big players have taken this route. Among notable unstructured data capability development, we find, for example:

There were fewer communications about in-house developments of structured synthetic data capabilities. One of the reasons could be the complexity and lower technology maturity, which, as we’ll note in the next section, is compensated by more partnerships.

Many big tech companies’ research departments have been actively hiring synthetic data engineers or privacy professionals these past months. It was notably the case for Apple’s Synthetic Data Group, TikTok Privacy Innovation (PI) Lab, or Mastercards.

Key partnerships and acquisitions

Finally, the last months saw a multitude of partnerships between synthetic data vendors and big tech or specialized companies. Where companies couldn’t (or wouldn’t) build, they bought and signed deals. The market is consolidating. Examples of deals in specialized industries:

Ecosystem development

A few signs are worth noting in ecosystem development. The technology receives increasing interest from regulators who have become aware of the need to set up a legal framework around them. Until now, no European actors have taken the initiative to produce recommendations on using technologies for data protection purposes. However, several of them have launched research to improve the understanding of the market and the needs of companies. The UK’s ICO launched a consultation on the draft of its guidance on anonymization, pseudonymization, and PETs.

Public institutions also show increasing interest, with calls for projects and public consultations. For example, the UK’s financial authority issued a call for input on 30 March on synthetic data. The EBSP undertook to monitor the development of these technologies through its TechSonar initiative, including a track for synthetic data. In Asia, the Hong Kong Monetary Authority launched in November 2021 a RegTech lab to further investigate, among other things, the development of synthetic data for Anti-Money Laundering.

Market analyses are flourishing everywhere. With research from major groups such as Gartner, Forester, or CBinsights at the top of the list, announcing a wide adoption of synthetic data in the years to come. Prophetic announcements or reality? In any case, these firms contribute significantly to influencing buyers, and their position impacts the development of the synthetic data market.

Finally, communities are gradually emerging, gravitating around the open source space, where people seek support in developing their tools. For example, the Synthetic Data Vault Slack space now counts 700 people. The Open SD was created to share educational analytic tools and resources around OpenSDPsynthR.

But synthetic data communities are steamed not only from the open source world, as vendors also try to generate momentum through communities that will support larger market education efforts.

From an insider perspective, the market started to reshape in the last few years. We now see strategic trends emerge. Startups and products keep coming as the mainstream news shares predictions about a synthetic future, and big techs send maturity signals through development and partnerships. There is of course much more to unpack: looking into the compliance challenges behind synthetic data, the use-cases and customer adoption, or the current technical limitations. But I’ll keep that for next post.

Let me know your thoughts about this article. Have you noticed other developments? Do you think the future is synthetic? Hit me up here or on LinkedIn and Twitter with your comments.

--

--

Elise Devaux

Personal blog of a tech enthusiast, digital marketer interested in synthetic data, data privacy, and climate tech. Currently works at cozero.io