What is Synthetic Data?

Synthetic data is artificially annotated information that is generated by computer algorithms or simulations, commonly used as an alternative to real-world data.
Copyright (c) 2022 Gretel.ai
Copyright (c) 2022 Gretel.ai

We define synthetic data as artificially annotated information that is commonly used as an alternative to real-world data, generated through computer algorithms or simulations to serve various purposes, including research, development, and testing, while addressing privacy and ethical considerations.

Synthetic Data Definition

Synthetic data refers to artificially generated data that mimics the characteristics of real-world data but does not contain information from actual individuals or events. What is synthetic data generation? It is the process by which synthetic data is created through various statistical methods, algorithms, or models to simulate the patterns, distributions, and structures found in real-world data. The primary purpose of synthetic data is to provide a substitute for real data when privacy concerns, data sensitivity, or legal restrictions make it challenging to use actual data.

Synthetic data can be used in various fields, including synthetic data for machine learning, data analysis, and software testing. Machine learning synthetic data, for example, can be employed to train models when obtaining sufficient real-world data is difficult or impractical. Additionally, it is vital for testing and validating algorithms without compromising the privacy of individuals or violating data protection regulations.

By using synthetic test data, researchers, data scientists, and developers can explore and test algorithms, models, and systems without exposing sensitive information or violating privacy regulations. It provides a way to overcome challenges related to data availability, especially in situations where obtaining sufficient real data may be difficult or impractical.

Creating synthetic data involves understanding the statistical properties and relationships present in real- world data and replicating them in a way that preserves the essential characteristics without revealing sensitive information. 

Synthetic data promises instant access to an unlimited supply of high-fidelity data that’s statistically accurate, privacy-protected and safe to share.

In this comprehensive overview, we’ll cover what synthetic data is, how to generate synthetic data, its various types and benefits, and why developers, data scientists and enterprise teams across industries are eager to use it.

Synthetic Data FAQs

Synthetic Data: Artificial Data, Actual Events

As mentioned above, synthetic data is commonly used as an alternative to real-world data. More specifically, it is artificially annotated information that is generated by computer algorithms or simulations. Research has shown that synthetic data for AI can be as good or even better than real-world data for data analysis and training models; and that it can be engineered to reduce biases in datasets and protect the privacy of any personal data that it’s trained on. With the right synthetic data generation tools, synthetic data is also easy to obtain by means of an AI-powered synthetic data generator, so it is considered a fast, cost-effective data augmentation technique, too. 

The privacy bottleneck

One of the biggest bottlenecks to innovation that developers and data scientists face today is getting access to data, or creating the data needed to test an idea or build a new product. In fact, in a recent Kaggle survey, 20,000 data scientists listed the “data gathering” stage as the single most time-consuming part of a typical project, accounting on average for 35% of the total work.

Time spent on different aspects of a typical data science project. Source: Kaggle user survey

Synthetic data offers faster access to sensitive data

From our experience working at AWS, Google, OpenAI, and with other leaders in the data industry, we know first hand that enabling developers to safely learn and experiment with data is the key to rapid innovation. Developers and data scientists don’t always need - or even want - access to sensitive or personally identifiable information (PII). That’s where synthetic data comes in. 

Synthetic data opens up the possibilities of enabling access to artificial and privacy-preserving versions of personal data in minutes, with 95% of the accuracy of the real-world data it was trained on, and without having to wait weeks for manual anonymization and approvals. 

Synthetic data can augment limited datasets

Not having enough of the right data is a second bottleneck that limits the utility of data. Whether you are a developer trying to program a chatbot to respond better to a rare question, or you are a medical researcher collecting sensor data from medical devices, data can be your most valuable asset and it may be prohibitively expensive or even impossible to collect more data. 

Synthetic data offers a unique solution to this problem. Modern synthetic data uses deep learning models such as language models (LM) and generative adversarial networks (often called GANs) where the models learn to create new, artificial, datasets with the same size and distributions as the data they were trained on. Using synthetic data generation methods such as conditional data generation, deep learning with synthetic data models can be tuned to generate records with particular attributes—for example, augmenting machine learning training sets to help algorithms better detect heart disease in female patients.

What are the trends driving synthetic data use?

Privacy by design, and doing more with less…

In the visual below, Gartner shared a prediction that by 2030 the use of synthetic data will overshadow real data in AI models. From what we have heard in customer conversations, this is already happening today, and it is driven by several key trends - including increased compliance costs and regulatory risks under new data protection regimes like GDPR and CCPA, record-high cybersecurity attacks and data breaches, privacy concerns related to training algorithms on customer data, high costs of manually annotating data to train machine learning algorithms, and more processing happening directly on devices (therefore requiring less data to be sent to the cloud).

Gartner synthetic data prediction

Why is Synthetic Data Important? 

At Gretel, we’ve seen three core use cases for synthetic data that benefit enterprises, developers, researchers, and applied scientists, too:

  1. Making private data accessible and shareable – allows for faster, safer collaboration with data that can accelerate time to value and unlock innovation.
  2. Generating more samples from limited datasets – provides a higher volume of data for applications, such as testing AI/ML models in pre-production and production-grade training. ‍
  3. Reducing bias in machine learning datasets – enables the creation of more representative and inclusive datasets that can better predict trends, for example, diagnosing heart disease in women.

Synthetic Data Informs AI/ML Models

Machine learning algorithms need lots of examples to perform well, and they generalize poorly against data outside what they have been trained on. Consider what it would be like to train a voice assistant such as Alexa to recognize new commands from users. For each new utterance, there are thousands of possible variations of speech, semantics, and slang that the algorithm needs to be able to understand. Traditionally, these algorithms have been trained on large amounts of real-world data that have been manually collected and annotated.

What if new phrases and combinations could be created automatically by a developer from just a few examples—to include variations in speech, background noises such as dogs barking, or even multiple voices talking at once? The power to create nearly infinite variations of new data is one of the most powerful use cases to generate synthetic data for machine learning.‍

Diverse Use Cases Across Multiple Industries and Business Functions

Being able to learn from and build with data is one of the fastest contributors to innovation today. As such, there are applications for synthetic data across industries. Here are some common synthetic data use cases:

  • Automotive and Robotics — leveraging synthetic data to create simulated environments for training robots, self-driving car software, and even testing safety and crash prevention technologies.
  • Cybersecurity and Infosec — using synthetic data to train machine learning models to better detect rare events including fraud and cyber attacks, as well as some more advanced technology that defends synthetic data against adversarial attacks.
  • Education — using synthetic student records, exam results, and course enrollment data. This type of data can be used for testing educational software and analytics systems while protecting student privacy.
  • Financial Services — creating synthetic time-series data to enable data sharing that doesn’t compromise their customers’ privacy, and also to create examples of rare events and anomalies (such as the Gamestop squeeze in 2021) to train algorithms to better respond to market events.
  • Gaming and Metaverse — using synthetic data to safely capture and study new forms of user data such as biometrics, heart rates, and eye movements.
  • Geospatial Data — using synthetic aperture data, geographical data points, routes, and mapping information. Synthetic geospatial data is valuable for testing navigation systems, geographic information systems (GIS), and location-based services.
  • Manufacturing — using synthetic data to simulate complex supply chain operations and predict where failures may occur. Synthetic data in manufacturing can be applied for testing and optimizing production processes and predictive maintenance algorithms.
  • Retail — using synthetic data to simulate options for laying out products in a store and how customers move through aisles. Synthetic retail data can also be employed for testing e-commerce platforms, recommendation systems, and inventory management systems.
  • Social Media — using synthetic social media posts, user profiles, and interactions. This type of data can be used for training sentiment analysis models and studying user behavior without using real social media content or real customer data.
  • Synthetic Data for Computer Vision — using synthetic images of objects, scenes, or people. Generated images can be used for training image recognition algorithms and testing computer vision applications.
  • Synthetic Data in Healthcare and Life Sciences — creating synthetic genomic data to fuel medical breakthroughs and encourage better medical care, as well as potentially create new revenue streams for hospitals and other medical institutions.
  • Text and Natural Language Processing (NLP) - creating artificial text documents, articles, or chat logs. Synthetic text data is useful for training NLP models, sentiment analysis, and language understanding without using real textual content.

Additionally, synthetic data can be used to accelerate different business functions such as: 

  • Marketing — enable wider access to marketing data by creating synthetic and privacy-preserving versions of sensitive user behavioral and transactional data.
  • Machine Learning — research has shown that augmenting machine learning datasets with synthetic data can lead to both better accuracy than using real-world-only data, and increased algorithmic fairness.
  • Agile Development and Operations (e.g. DevOps, MLOps) — synthetic and de-identified data is being used by companies today to power pre-production and testing environments with all of the dynamism of real-world data and none of the privacy risks.
  • Human Resources — enable wider access to marketing data by creating synthetic and privacy-preserving versions of sensitive HR data.

How Accurate Is Synthetic Data vs. Real Data?

In contrast to terms often used to describe synthetic data such as “fake data” or “mock data”, quality synthetic data can be nearly as accurate as the real data that it is based on, since we generate synthetic data from real data. In some cases synthetic data can even improve on the accuracy of real-world data. This is possible as synthetic data models can generate many more samples from training data that can help downstream machine learning algorithms better generalize. In fact, in testing Gretel’s synthetic data accuracy against the top 8 machine learning datasets on the data science platform Kaggle, synthetic data often outperformed real-world data accuracy for downstream classification tasks, with a mean accuracy less than 1% from their real-world equivalents. Even with Gretel’s advanced privacy features enabled on “high”—which prevents overfitting, detects dangerous similarities and outliers, and adds noise (differential privacy) to the dataset—the mean synthetic data accuracy was within 2% of real-world accuracy. 

Comparing synthetic data & real-world data accuracy

As mentioned above, in some cases synthetic data can even out-perform its real-world counterparts. Consider the example below on the UCI Credit Card Default dataset, where models trained on synthetic data outperformed their real-world counterparts with an average accuracy of 78.39% vs 76.56% (+1.25%) in predicting whether a person would default in credit card payments across machine learning classifiers including Random Forest, XGBoost, and SVM.

Synthetic data performance on UCI Credit Card Default dataset

Determining Synthetic Data Quality

There are three core indicators that when combined and weighted accordingly can help assess the quality of a synthetic data set and correlated use cases for it, if any, that it might be suitable for - field correlation stability, deep structure stability, and field distribution stability. All three metrics are part of Gretel’s Synthetic Quality Score (SQS), scores range from 0-100 and are provided every time a user generates a synthetic model. We’ll walk through each metric using screenshots from a typical SQS report. 

1. Field Correlation Stability – the correlation between every pair of fields in the real-world data vs. synthetic data. To aid in the comparison of field correlations, this metric is shown as a heatmap in the synthetic quality report showing the computed difference of correlation values.

‍

Field Correlation Stability heatmap

‍2. Deep Structure Stability – To verify the statistical integrity of deeper, multi-field distributions and correlation, Gretel compares a Principal Component Analysis (PCA) computed first on the original data, and then on the synthetic data. Look for a similar size and distribution of elements to verify that the model has not overfitted on the synthetic training data.

PCA comparison of original training data and synthetic data

3. Field Distribution Stability – Field Distribution Stability is a measure of how closely the field distributions in synthetic data mirror those in the original data. For each field, we use a common approach for comparing two distributions referred to as the Jensen-Shannon Distance.

Field distribution comparison using Jensen-Shannon distance

Understanding a synthetic data model’s quality is about determining its accuracy and the level of privacy it provides. This is vital information when the synthetic data will be used to answer critical questions such as “Is this patient at risk for a stroke?” 

Gretel’s Synthetic Data Quality Report is a digestible report that provides you with an overall quality score (from 0-100) and a summary table of the viable use cases you could use the data for, as well as a more detailed analysis of how your synthetic data generation process went.

Overview of Gretel's Synthetic Quality Report (SQR)
SQR summary of viable use cases for your synthetic data
SQR summary of viable use cases for your synthetic data

For a deeper dive on quantifying the accuracy of synthetic data for downstream tasks such as business analytics, forecasting, and training machine learning models, check out Gretel’s case study with one of the largest financial institutions in the world, where they successfully created a process to enable internal data sharing of synthetic data to replace a legacy and error-prone process of manually de-identifying and redacting datasets.

Synthetic Data vs Data Masking

Synthetic data involves the creation of entirely new data, while data masking focuses on protecting sensitive information within existing real datasets. Both techniques contribute to privacy preservation but are suited for different scenarios and use cases.

Data masking involves applying techniques such as encryption, tokenization, or other methods to replace or modify sensitive information in the original dataset. The goal is to make it difficult to link the masked data back to the actual individuals. Data masking is commonly used when working with real data for various purposes, such as sharing datasets for analysis or testing, while ensuring that personally identifiable information (PII) remains protected.

Here are some key differences between the two techniques:

Nature of Data:

  • Synthetic data involves creating entirely new data that does not have a one-to-one correspondence with any specific real individual or event.
  • Data masking involves working with real data but applying transformations to conceal or modify specific details.

Use Cases:

  • Synthetic data is often used in situations where obtaining real data is challenging, and the focus is on creating a substitute for training models or conducting analyses.
  • Data masking is applied when working with real data is necessary, but the goal is to protect sensitive information, allowing the use of the data for certain purposes without compromising privacy.

Approach to Privacy:

  • Synthetic data provides a privacy-by-design approach, as the generated data is entirely artificial and free from real-world identifiers.
  • Data masking focuses on obscuring or transforming sensitive elements within real data, preserving the original structure but making it more privacy-compliant.

What are the Benefits of Synthetic Data?

Synthetic data offers several benefits across various industries and applications, making it a valuable tool for researchers, data scientists, and developers. Here are some key advantages of using synthetic data:

  • Privacy Preservation: Synthetic data allows for the creation of artificial datasets that do not contain any real information about individuals. This helps to address privacy concerns associated with using sensitive or personally identifiable information in testing, research, or development.
  • Compliance with Data Regulations: Synthetic data enables organizations to comply with data protection regulations and privacy laws by avoiding the use of actual personal data in scenarios where compliance is challenging.
  • Data Security: Since synthetic data is entirely artificial, there is no risk of exposing real and sensitive information during testing or development processes. This enhances data security and reduces the potential for data breaches.
  • Data Diversity: Synthetic data can be generated to represent a diverse range of scenarios, including edge cases or unusual situations. This helps to ensure that machine learning models or algorithms are exposed to a broader spectrum of possibilities during training.
  • Availability and Accessibility: In situations where obtaining real data is difficult, expensive, or restricted, synthetic data provides a readily available alternative. This is particularly beneficial for research and development in niche or emerging fields.
  • Synthetic Data Testing and Validation: Synthetic data allows for thorough testing and validation of algorithms, models, or systems without the need for real-world data. This can help identify and address issues in a controlled environment.
  • Overcoming Data Scarcity: In domains where data scarcity is a challenge, such as certain medical or rare event scenarios, synthetic data can be generated to supplement limited real datasets and improve the robustness of models.
  • Reduced Bias and Fairness Concerns: Synthetic data generation provides an opportunity to reduce biases present in real data, leading to fairer models. By carefully designing synthetic data sets, developers can control and mitigate biases that might exist in the original data.
  • Resource Efficiency: Synthetic data generation can be a more resource-efficient approach compared to collecting, cleaning, and managing large volumes of real data. It allows organizations to achieve similar testing or training objectives with less effort and cost.
  • Innovative Research: Researchers can use synthetic data to explore new ideas, test hypotheses, and conduct experiments without the constraints or ethical considerations associated with real data.

Additionally, businesses can use synthetic data for various direct benefits: 

  • helps businesses streamline access to data for users up to 10x faster than traditional approaches, and with reduced privacy risks
  • can be used to simulate not yet encountered conditions
  • offers protections against privacy attacks including re-identification and data linkage attacks, as none of the records generated by the synthetic model are based on any single individual
  • can offer mathematical privacy guarantees to enable sharing of sensitive data when combined with formal privacy techniques such as differential privacy
  • can be generated at a fraction of the cost of current approaches for annotating datasets to train machine learning algorithms.
  • advanced synthetic data models can preserve the multivariate relationships between variables better than statistical approaches

What are Synthetic Data Challenges?

Some of the challenges with synthetic data include:

Types of Synthetic Data

The main categories of text-based synthetic data include natural language, tabular, and time-series data. Additional categories include image synthetics, audio, video, and simulation.

Tabular data

Tabular data including tables, databases, and machine learning training sets is a very common use case. A table is an arrangement of information or data, typically in rows and columns, or possibly in a more complex structure. Popular formats for tabular data include comma-separated-value (CSV), JavaScript Object Notation (JSON), and big data formats such as Parquet. Synthetic data models such as language models (LMs) and generative adversarial networks (GANs) are effective at learning and recreating these kinds of datasets

Time series

Time series data is often tabular data, such as sensor data or financial data, that has a time component that must be learned and maintained across multiple rows. Consider a heart rate sensor on a fitness watch, where each row might be a point in time, and a user’s heart rate should increase gradually as they increase running speed or start going uphill. This requirement requires the synthetic data model to maintain a level of state or memory as it recreates datasets to effectively recreate trends and seasonality in synthetic data. Synthetic data models such as GANs have proven particularly effective at being able to recreate these kinds of time-series components.

Text

Text or natural language data is another popular use case for synthetic data, and being able to recreate convincing natural language from models trained on just a few examples is a popular use case for developers and data scientists training chatbots, voice assistants, and a multitude of other use cases. Language models such as Transformers have been developed specifically for this task by adding an attention component, with examples such as GPT-3 being trained on 175 Billion parameters (45 TB of text), being able to generate convincing natural language given only a few prompts.

Images and Video

Generating realistic images including human faces has been a popular research task over the past years and has led to tremendous advances in GANs. Examples such as Nvidia’s StyleGAN, CycleGAN, and now FastCUT have shown that it is possible to synthesize high-quality images based on only labels and publicly available image datasets. For more information, read our blog on synthesizing user location data on public maps using FastCUT.

Simulation

Synthetic data is one of the hottest use cases in the simulation space, where 3D simulations such as computer game engines (see Unity3D’s Perception) can be used to train anything from robots, self-driving cars, and other autonomous systems to navigate real-world situations at a scale and speed that is not possible with real-world testing. For additional synthetic data examples, check out Facebook’s AI Habitat and OpenAI’s gym.

Audio

Technologies for audio and speech synthesis are used heavily in music, and in the voice assistants that many of us use every day such as Siri, Alexa, Google Assistant, etc. The need to train automated speech recognition (ASR) algorithms to recognize different kinds of voices, linguistics, and speech has also created the need to synthesize voices and variations of terms, as collecting data from the real world or having speech generated and annotated manually is prohibitively costly.

What is Synthetic Data Software?

How is synthetic data generated? One common method is using synthetic data software provided by synthetic data companies. Synthetic data software refers to tools and platforms designed to generate artificial or synthetic data sets for various purposes. Synthetic data tools and a synthetic data platform use algorithms, statistical models, and other synthetic data generation techniques to create data that closely mimics the characteristics of real-world data without containing any actual information about individuals or events. 

Synthetic data generation software is often used in situations where privacy concerns, data sensitivity, or legal restrictions make it challenging to use real data for testing, development, or research purposes.

Key features and functionalities of synthetic data software may include:

  • Data Generation Algorithms: Utilize algorithms to create synthetic data that replicates the statistical properties, distributions, and relationships present in real data.
  • Customization Options: Allow users to customize the processes for generating synthetic data based on specific requirements, such as data types, structures, and desired characteristics.
  • Privacy-Preserving Techniques: Implement privacy-preserving methods to ensure that the generated synthetic data does not reveal sensitive or personally identifiable information.
  • Data Validation and Quality Assurance: Include tools for validating and assessing the quality of the synthetic data, ensuring that it accurately represents the features of the real data.
  • Support for Various Data Types: Offer support for generating synthetic data across different data types, including numerical, categorical, textual, and spatial data.
  • Integration with Machine Learning Frameworks: Provide integration with popular machine learning frameworks, allowing users to use synthetic data for deep learning, training and testing machine learning models.
  • Scalability: Some synthetic data solutions are designed to scale efficiently, enabling the generation of large datasets for diverse applications.
  • Rule-Based Generation: Incorporate rule-based generation capabilities, allowing users to define specific rules and constraints for generating synthetic data.
  • Data Masking and Anonymization: In some cases, synthetic data providers may include features for data masking or data anonymization, allowing users to protect sensitive information in real datasets.

What is Multimodal Synthetic Data Generation?

In the context of synthetic data, "multimodal" refers to the integration of multiple modes or types of data into the synthetic dataset. Modalities can include different types of information such as text, images, audio, or other forms of data. Multimodal synthetic data generation aims to replicate the diversity and complexity found in real-world datasets by incorporating various data modalities.

For example, a synthetic dataset for autonomous vehicle training might be multimodal, including simulated images, sensor data, and corresponding textual descriptions. This approach allows machine learning models to learn from a broader range of data sources, enhancing their ability to generalize to different types of information encountered in real-world scenarios.

The use of multimodal synthetic data is particularly valuable when training models for tasks that involve multiple types of input or output, enabling more robust and realistic training experiences for artificial intelligence systems.

Gretel’s Advanced Synthetic Data Privacy Guarantees

In conclusion, whether you need to access or share private data, augment a limited supply of data, or reduce biases in datasets, synthetic data is an efficient and cost-effective way to solve your problem. However, it’s important to remember that when it comes to using synthetic data, quality matters a lot. 

At Gretel, we’re laser-focused on developing advanced data privacy techniques for generating synthetic data sets with privacy-protection mechanisms that protect them from adversarial attacks with mathematical guarantees. We’re also making sure all of these products and features are accessible via easy-to-use APIs, so developers and data scientists can be confident they are getting high-quality synthetic data whenever and wherever they need it. 

Going forward, we will continue to post research, source code, and examples about enabling data sharing and access at scale. Sign in to our free tier and give Gretel a try, you can run Gretel’s APIs to Synthesize, Transform, or Classify data – no code required. 

Learn more about how Gretel's Synthetic Data solutions can empower your team today.