What is Synthetic Data?
What if you could have instant access to an unlimited supply of high-fidelity data that’s statistically accurate, privacy-protected and safe to share? That’s the promise of synthetic data.
In this blog, we’ll cover what synthetic data is, how it’s made, its various types and benefits, and why developers, data scientists and enterprise teams across industries are eager to use it.
Synthetic Data: Artificial Data, Actual Events
Synthetic data is commonly used as an alternative to real-world data. More specifically, it is artificially annotated information that is generated by computer algorithms or simulations. Research has shown that synthetic data can be as good or even better than real-world data for data analysis and training AI models; and that it can be engineered to reduce biases in datasets and protect the privacy of any personal data that it’s trained on. With the right tools, synthetic data is also easy to generate, so it is considered a fast, cost-effective data augmentation technique, too.
The privacy bottleneck
One of the biggest bottlenecks to innovation that developers and data scientists face today is getting access to data, or creating the data needed to test an idea or build a new product. In fact, in a recent Kaggle survey, 20,000 data scientists listed the “data gathering” stage as the single most time-consuming part of a typical project, accounting on average for 35% of the total work.
Synthetic data offers faster access to sensitive data
From our experience working at AWS, Google, OpenAI, and with other leaders in the data industry, we know first hand that enabling developers to safely learn and experiment with data is the key to rapid innovation. Developers and data scientists don’t always need - or even want - access to sensitive or personally identifiable information (PII). That’s where synthetic data comes in.
Synthetic data opens up the possibilities of enabling access to artificial and privacy-preserving versions of personal data in minutes, with 95% of the accuracy of the real-world data it was trained on, and without having to wait weeks for manual anonymization and approvals.
Synthetic data can augment limited datasets
Not having enough of the right data is a second bottleneck that limits the utility of data. Whether you are a developer trying to program a chatbot to respond better to a rare question, or you are a medical researcher collecting sensor data from medical devices, data can be your most valuable asset and it may be prohibitively expensive or even impossible to collect more data.
Synthetic data offers a unique solution to this problem. Modern synthetic data uses deep learning models such as language models (LM) and generative adversarial networks (often called GANs) where the models learn to create new, artificial, datasets with the same size and distributions as the data they were trained on. Using techniques such as conditional data generation, synthetic data models can be tuned to generate records with particular attributes—for example, augmenting machine learning training sets to help algorithms better detect heart disease in female patients.
Trends driving synthetic data use
Privacy by design, and doing more with less…
In the visual below, Gartner shared a prediction that by 2030 the use of synthetic data will overshadow real data in AI models. From what we have heard in customer conversations, this is already happening today, and it is driven by several key trends - including increased compliance costs and regulatory risks under new data protection regimes like GDPR and CCPA, record-high cybersecurity attacks and data breaches, privacy concerns related to training algorithms on customer data, high costs of manually annotating data to train machine learning algorithms, and more processing happening directly on devices (therefore requiring less data to be sent to the cloud).
Why is Synthetic Data Important?
At Gretel, we’ve seen three core use cases for synthetic data that benefit enterprises, developers, researchers, and applied scientists, too:
- Making private data accessible and shareable – allows for faster, safer collaboration with data that can accelerate time to value and unlock innovation.
- Generating more samples from limited datasets – provides a higher volume of data for applications, such as testing AI/ML models in pre-production and production-grade training.
- Reducing bias in machine learning datasets – enables the creation of more representative and inclusive datasets that can better predict trends, for example, diagnosing heart disease in women.
Synthetic Data Informs AI/ML Models
Machine learning algorithms need lots of examples to perform well, and they generalize poorly against data outside what they have been trained on. Consider what it would be like to train a voice assistant such as Alexa to recognize new commands from users. For each new utterance, there are thousands of possible variations of speech, semantics, and slang that the algorithm needs to be able to understand. Traditionally, these algorithms have been trained on large amounts of real-world data that have been manually collected and annotated.
What if new phrases and combinations could be created automatically by a developer from just a few examples—to include variations in speech, background noises such as dogs barking, or even multiple voices talking at once? The power to create nearly infinite variations of new data is one of the most powerful use cases for synthetic data and helps drive advancements in machine learning.
Diverse Use Cases Across Multiple Industries and Business Functions
Being able to learn from and build with data is one of the fastest contributors to innovation today. As such, there are applications for synthetic data across industries. Here are some common examples:
- Automotive and Robotics — leveraging synthetic data to create simulated environments for training robots, self-driving car software, and even testing safety and crash prevention technologies.
- Financial Services — creating synthetic time-series data to enable data sharing that doesn’t compromise their customers’ privacy, and also to create examples of rare events and anomalies (such as the Gamestop squeeze in 2021) to train algorithms to better respond to market events.
- Cybersecurity and Infosec — using synthetic data to train machine learning models to better detect rare events including fraud and cyber attacks, as well as some more advanced technology that defends synthetic data against adversarial attacks.
- Social Media — using synthetic data to train recommendation algorithms without having to use real customer data.
- Gaming and Metaverse — using synthetic data to safely capture and study new forms of user data such as biometrics, heart rates, and eye movements.
- Healthcare and Life Sciences — creating synthetic genomic data to fuel medical breakthroughs and encourage better medical care, as well as potentially create new revenue streams for hospitals and other medical institutions.
- Manufacturing — using synthetic data to simulate complex supply chain operations and predict where failures may occur.
- Retail — using synthetic data to simulate options for laying out products in a store and how customers move through aisles.
Additionally, synthetic data can be used to accelerate different business functions such as:
- Marketing — enable wider access to marketing data by creating synthetic and privacy-preserving versions of sensitive user behavioral and transactional data.
- Machine Learning — research has shown that augmenting machine learning datasets with synthetic data can lead to both better accuracy than using real-world-only data, and increased algorithmic fairness.
- Agile Development and Operations (e.g. DevOps, MLOps) — synthetic and de-identified data is being used by companies today to power pre-production and testing environments with all of the dynamism of real-world data and none of the privacy risks.
- Human Resources — enable wider access to marketing data by creating synthetic and privacy-preserving versions of sensitive HR data.
How Accurate Is Synthetic Data vs. Real Data?
In contrast to terms often used to describe synthetic data such as “fake data” or “mock data”, quality synthetic data can be nearly as accurate as the real data that it is based on, and in some cases even improve on the accuracy of real-world data. This is possible as synthetic data models can generate many more samples from training data that can help downstream machine learning algorithms better generalize. In fact, in testing Gretel’s synthetic data accuracy against the top 8 machine learning datasets on the data science platform Kaggle, synthetic data often outperformed real-world data accuracy for downstream classification tasks, with a mean accuracy less than 1% from their real-world equivalents. Even with Gretel’s advanced privacy features enabled on “high”—which prevents overfitting, detects dangerous similarities and outliers, and adds noise (differential privacy) to the dataset—the mean synthetic data accuracy was within 2% of real-world accuracy.
As mentioned above, in some cases synthetic data can even out-perform its real-world counterparts. Consider the example below on the UCI Credit Card Default dataset, where models trained on synthetic data outperformed their real-world counterparts with an average accuracy of 78.39% vs 76.56% (+1.25%) in predicting whether a person would default in credit card payments across machine learning classifiers including Random Forest, XGBoost, and SVM.
Determining Synthetic Data Quality
There are three core indicators that when combined and weighted accordingly can help assess the quality of a synthetic dataset and correlated use cases for it, if any, that it might be suitable for - field correlation stability, deep structure stability, and field distribution stability. All three metrics are part of Gretel’s Synthetic Quality Score (SQS), scores range from 0-100 and are provided every time a user generates a synthetic model. We’ll walk through each metric using screenshots from a typical SQS report.
1. Field Correlation Stability – the correlation between every pair of fields in the real-world data vs. synthetic data. To aid in the comparison of field correlations, this metric is shown as a heatmap in the synthetic quality report showing the computed difference of correlation values.
2. Deep Structure Stability – To verify the statistical integrity of deeper, multi-field distributions and correlation, Gretel compares a Principal Component Analysis (PCA) computed first on the original data, and then on the synthetic data. Look for a similar size and distribution of elements to verify that the model has not overfitted on the training data.
3. Field Distribution Stability – Field Distribution Stability is a measure of how closely the field distributions in synthetic data mirror those in the original data. For each field, we use a common approach for comparing two distributions referred to as the Jensen-Shannon Distance.
Understanding a synthetic data model’s quality is about determining its accuracy and the level of privacy it provides. This is vital information when the synthetic data will be used to answer critical questions such as “Is this patient at risk for a stroke?”
Gretel’s Synthetic Data Quality Report is a digestible report that provides you with an overall quality score (from 0-100) and a summary table of the viable use cases you could use the data for, as well as a more detailed analysis of how your synthetic data generation process went.
For a deeper dive on quantifying the accuracy of synthetic data for downstream tasks such as business analytics, forecasting, and training machine learning models, check out Gretel’s case study with one of the largest financial institutions in the world, where they successfully created a process to enable internal data sharing of synthetic data to replace a legacy and error-prone process of manually de-identifying and redacting datasets.
The Benefits of Synthetic Data
Here are some of the top benefits of synthetic data:
- Synthetic data helps businesses streamline access to data for users up to 10x faster than traditional approaches, and with reduced privacy risks.
- Synthetic data can be used to simulate not yet encountered conditions.
- Synthetic data offers protections against privacy attacks including re-identification and data linkage attacks, as none of the records generated by the synthetic model are based on any single individual.
- When combined with formal privacy techniques such as differential privacy, synthetic data can offer mathematical privacy guarantees to enable sharing of sensitive data.
- Synthetic data can be generated at a fraction of the cost of current approaches for annotating datasets to train machine learning algorithms.
- Advanced synthetic data models can preserve the multivariate relationships between variables better than statistical approaches.
Synthetic Data Challenges
Some of the challenges with synthetic data include:
- Highly dimensional datasets with hundreds or thousands of columns such as medical genomics data, or complex IoT sensor data can be compute-intensive.
- Synthesizing relational datasets can require some manual configuration.
- Some privacy-preserving technologies such as differential privacy, require large amounts of data to provide strong privacy guarantees without degrading accuracy, and thus may not be appropriate for all datasets.
- Synthetic data generation requires time and effort.
Types of Synthetic Data
The main categories of text-based synthetic data include natural language, tabular, and time-series data. Additional categories include image synthetics, audio, video, and simulation.
Time series data is often tabular data, such as sensor data or financial data, that has a time component that must be learned and maintained across multiple rows. Consider a heart rate sensor on a fitness watch, where each row might be a point in time, and a user’s heart rate should increase gradually as they increase running speed or start going uphill. This requirement requires the synthetic data model to maintain a level of state or memory as it recreates datasets to effectively recreate trends and seasonality in synthetic data. Synthetic data models such as GANs have proven particularly effective at being able to recreate these kinds of time-series components.
Text or natural language data is another popular use case for synthetic data, and being able to recreate convincing natural language from models trained on just a few examples is a popular use case for developers and data scientists training chatbots, voice assistants, and a multitude of other use cases. Language models such as Transformers have been developed specifically for this task by adding an attention component, with examples such as GPT-3 being trained on 175 Billion parameters (45 TB of text), being able to generate convincing natural language given only a few prompts.
Images and Video
Generating realistic images including human faces has been a popular research task over the past years and has led to tremendous advances in GANs. Examples such as Nvidia’s StyleGAN, CycleGAN, and now FastCUT have shown that it is possible to synthesize high-quality images based on only labels and publicly available image datasets. For more information, read our blog on synthesizing user location data on public maps using FastCUT.
Synthetic data is one of the hottest use cases in the simulation space, where 3D simulations such as computer game engines (see Unity3D’s Perception) can be used to train anything from robots, self-driving cars, and other autonomous systems to navigate real-world situations at a scale and speed that is not possible with real-world testing. For additional examples, check out Facebook’s AI Habitat and OpenAI’s gym.
Technologies for audio and speech synthesis are used heavily in music, and in the voice assistants that many of us use every day such as Siri, Alexa, Google Assistant, etc. The need to train automated speech recognition (ASR) algorithms to recognize different kinds of voices, linguistics, and speech has also created the need to synthesize voices and variations of terms, as collecting data from the real world or having speech generated and annotated manually is prohibitively costly.
Gretel’s Advanced Synthetic Data Privacy Guarantees
In conclusion, whether you need to access or share private data, augment a limited supply of data, or reduce biases in datasets, synthetic data is an efficient and cost-effective way to solve your problem. However, it’s important to remember that when it comes to using synthetic data, quality matters a lot.
At Gretel, we’re laser-focused on developing advanced data privacy techniques for generating synthetic datasets with privacy-protection mechanisms that immunize them from adversarial attacks. We’re also making sure all of these products and features are accessible via easy-to-use APIs, so developers and data scientists can be confident they are getting high-quality synthetic data whenever and wherever they need it.
Going forward, we will continue to post research, source code, and examples about enabling data sharing and access at scale. Sign in to our free tier and give Gretel a try, you can run Gretel’s APIs to Synthesize, Transform, or Classify data – no code required.