How to Generate Synthetic Data: Tools and Techniques to Create Interchangeable Datasets
The Concept of Synthetic Data Is Simple…
Synthetic data is artificially annotated information that is generated by computer algorithms or simulations. Often, synthetic data is used as a substitute when suitable real-world data is not available – for instance, to augment a limited machine learning dataset with additional examples. In other cases where real-world data cannot be used due to privacy concerns or compliance risks, synthetic versions of that data can be safely shared and mined for insights, without revealing any sensitive or personally identifiable information (PII). This “shareable” quality can enable, for example, medical research between hospitals where a researcher can learn about a disease, but not the real patients that the data was based on.
Getting access to data is one of the bottlenecks to progress and innovation today. In fact, in a recent Kaggle survey, 20,000 data scientists listed the “data gathering” stage as the most time consuming part of a typical project, accounting for 35% of the total work. In certain fields such as life sciences research, requesting access to genomic datasets for researching cancer or rare disease can take 6 months or more for approvals.
Synthetic data opens up possibilities of enabling access to artificial and privacy-preserving versions of data in minutes, or the ability to augment a machine learning dataset for superior accuracy and fairness than one trained only with real-world data.
…But the Possibilities are Endless
Here are some examples of how synthetic data is utilized for different purposes today.
- Implementing Privacy-by-Design Principles - Privacy protection is one of the most compelling use cases for synthetic data. Recent advances in machine learning have made it possible for deep learning models such as language models (LMs) and generative adversarial networks (GANs) to learn and recreate correlations and insights in datasets with high accuracy. By incorporating privacy-enhancing technologies such as Differential Privacy, and advanced data privacy filtering techniques it’s possible to provide mathematical guarantees so that a machine learning model will not memorize any individual user’s personal data or secrets that exist in the dataset. With these levels of protection in place, you can bypass time-consuming and complex manual data anonymization tasks, and enable fast and safe sharing of information between teams, organizations, and even companies. For example, a prominent global financial institution created a process for internal teams to generate highly accurate and private synthetic data to share within their business using Gretel.ai.
- Creating Safe Data Retention Policies - One challenge that any company that is focused on regulatory compliance such as GDPR, CCPA, SOC, ISO, PCI, FedRamp, and others must consider is data retention. When setting a data retention policy, the company makes a commitment to its customers to delete data after a period to comply with government and industry regulations. Without data retention policies, companies might store too much information, leading to operational inefficiencies, increased costs, and legal and security risks. Synthetic data provides an interesting alternative to simply deleting data. In this case, companies can simply synthesize real-world datasets as they reach the end of their retention policies, and create new synthetic datasets that are no longer based on any individual or entity. While the customer would lose the ability to query raw data and customer events, this is often no longer necessary or relevant after years of data, they retain the ability to query the synthetic data to create trends, graphs, and trend analyses by using the synthetic data in place of real-world data that can be deleted.
- Testing Software Products and Services - Often, developers are not able to use real production data for their development, testing, and staging environments due to privacy and security concerns that make using real data impractical. Because of this, testing environments are often limited to using snapshots of production data that are manually de-identified by developers, or low-quality fake data or mock data. Snapshotting and manually anonymizing data for pre-production testing has the disadvantage of quickly growing stale and not mimicking the changes happening in the production environment, as well as often missing PII and other sensitive data during a manual and error-prone process. Simply creating fake data, or mock data, is an alternative that works for simple use cases - like showing example data for a user interface table - but it lacks realism and correlations from real data, or cannot replicate the kinds of events happening in the production system. With synthetic data, it is possible to generate data that looks, acts and feels just like your production data to enable rapid, and realistic testing and feature development.
- Training ML and AI Models - Synthetic data offers the promise of better data for increasingly data-hungry machine learning models. Today, these models increasingly control critical decisions in our lives, such as whether we get hired for a job, approved for a loan, diagnosed for different conditions or diseases, and even communicate with the devices and world around us. While incredibly powerful, machine learning models struggle to interpret data that they have not seen or have only rarely seen before. For developers and companies building machine learning algorithms, the process of getting more data (such as medical test results, examples of cyber attacks, or image and video data to train robotics) can be prohibitively expensive or even impossible to collect. Synthetic data provides a scalable and cost-effective alternative to working with real customer data, or manually generating and annotating additional data.
- Sharing Data Within Organizations - There is a business trend towards smaller development teams (such as the two-pizza team coined by Amazon), that are made up of smaller numbers of developers, but are empowered with increased levels of ownership of their products and features. While this model makes development fast and agile, it often results in silos being created around data that individual teams have access to, for example in a database or warehouse hosted by each team. When one team or another business unit in the company needs access to information from siloed data, it can take weeks or months for this to become possible. This problem also exists and is perhaps amplified with businesses using data lakes, lake houses, and data warehouses to centralize data. While the data itself can be centralized, a lack of understanding about where sensitive data exists in each table or data lake often results in companies needing to treat entire datasets as containing customer data and thus only enabling access to a small percentage of overall employees and business units in a company. With synthetic data, artificial and fully anonymized datasets can be created in a warehouse or data lake that can be freely queried from users across the business without going through lengthy approval processes. For users that need raw access to real-world data, that process can be granted on a case-by-case basis, but no longer becomes a blocker for innovation and progress.
- Sharing Data With Third Parties - Often, companies need to share sensitive data outside of their organization. Modern attacks on data such as data linkage, re-identification, and joinability attacks have shown that traditional anonymization and de-identification techniques are not sufficient to protect the privacy of individuals and other sensitive data inside these datasets. For a practical example, consider the Netflix Prize Challenge - a competition hosted by Netflix with a $1M prize to the team that could best Netflix’s own internal algorithms on movie recommendations. Before hosting the challenge, the Netflix team manually anonymized 100M movie reviews - leaving only the following fields: movie_id, user_id, date, and review ratings. However, researchers from the University of Austin demonstrated how an adversary with only basic knowledge of a subscriber could identify their records, and even “uncovering their apparent political preferences and other potentially sensitive information”. Using techniques such as synthetic data, it is now possible to create a similar dataset of 100M reviews by training a model on the dataset using differential privacy, and then using it to generate a synthetic dataset. This model would contain many of the insights of the real-world data but would offer increased - and even mathematically guaranteed levels of privacy, with only a small loss to its overall accuracy.
- Creating Synthetic Marketplaces & Accessible Data Exchanges - Today, companies are also looking to create new monetization streams with marketplaces based on synthetic data. This is of particular interest to medical researchers and organizations in the healthcare and life sciences sectors. For example, several departments at the University of Arkansas Medical Sciences (UAMS) school recently partnered to create a custom synthetic data electronic medical record software platform that is a privacy-protected replica of actual patient data. This gives researchers and medical students safe access to medical providers across the network as well as transferable data that’s based on more than 65,000 patients across the state. As an educational tool, it gives students invaluable hands-on experience while protecting the privacy and welfare of real patients.
- ‘Scaling Out’ Developments in the Cloud - Many regulated industries are making the transition from on-premises and owned clouds, to hybrid and multi-cloud deployments. Often, while many workloads can be directly transitioned to a cloud environment such as AWS, GCP, or Azure, certain workloads with regulated data, such as customer financial transactions, need to stay inside the customer’s approved internal environment. While this meets strict compliance and security requirements, it does not allow these workloads to benefit from the “scale-out” and elasticity benefits of the cloud - such as autoscaling, or access to expensive GPUs for machine learning training and inference. To get the best of both worlds, companies in regulated industries can create fully synthetic datasets with strong privacy guarantees that can be used by developers and data scientists for rapid experimentation and testing in the cloud. For example, imagine running a parameter sweep of 1,000+ experiments with different machine learning parameters to find an optimal configuration. This approach takes advantage of the scalability and on-demand nature of the cloud to quickly run the experiment, and then data scientists can train their production model on real-world data in their own cloud using the optimal parameters. This approach is often referred to as “scale-out development”.
There are also dozens of industry-specific use cases for synthetic data. Here are some of the most common.
- Financial Services - Financial services is one of the industries going through the most change, as businesses are increasingly able to learn from and leverage their data to create better experiences for their customers. Financial data is often very sensitive and is regulated by compliance protocols such as GDPR, CCPA, PCI, ISO, and SOC certifications. Financial institutions seek ways to safely enable internal access to data, and even to share data with partners. For example, consider the Apple and Goldman Sachs partnership around Apple Card. Synthetic data offers a unique ability to share datasets that look and feel just like real-world data, and can be queried like real-world data, but that is not based on any individual’s data, enabling sharing without the concerns about how customers data may be used by another organization.
- Manufacturing - Manufacturing has seen a tremendous change in the past few years, as the pandemic and supply chain shortages have fueled the need for smarter, more resilient supply chains. One of the top use cases that we see from manufacturing companies is around creating realistic, artificial consumer product group (CPG) datasets that can be used to generate and test scenarios for supply chains. A second area where we see synthetic data being used by manufacturing is to train automation (such as robotics) to better respond to varying conditions, such as lighting, environments, by training in simulated environments.
- Healthcare and Life Sciences - In the past years, data-driven medicine has begun to revolutionize healthcare and life sciences. In our work with different life sciences organizations, we often hear that the biggest challenge — and opportunity — for the life sciences community is learning how to safely share patient health data, while protecting the privacy of patients. For a real-life example, read Gretel’s joint case study with Illumina’s Emerging Solutions team, where we use state-of-the-art generative neural networks to recreate artificial versions of the highly complex genomic sequences used by life sciences researchers. These synthetic datasets are based on real-world data, and offer enhanced privacy guarantees that can enable life science researchers to better collaborate and quickly test ideas through open access to data, without compromising patient privacy.
- Automotive and Robotics - leveraging synthetic data to create simulated environments for training robots, self-driving car software, and even testing safety and crash prevention technologies.
Techniques to Generate Synthetic Data
Now, we’ll take a look at three common techniques for generating synthetic, starting with the most basic approach.
Drawing Numbers From a Distribution
In contrast to more advanced, machine learning-based approaches, a popular technique for generating synthetic data can be to simply draw, or sample numbers from a distribution. While this approach does not capture the insights of real-world data, it can create a distribution of data that follows a curve that is loosely based on real-world data.
For this example, we will use Python and NumPy library’s numpy.random.randn() function to create a set of four datasets using a “normal” distribution of variables, each with a slight change to the centerpoint.
Here's the code if you'd like to try it yourself.
Agent-based modeling (ABM) is a simulation technique where individual agents are created that interact with each other. These techniques are particularly useful to examine interactions between agents (e.g., people, cells, or even computer programs) in a complex system. Python packages such as Mesa make it easy to quickly create agent-based models using built-in core components, and to visualize them in a browser-based interface.
Generative modeling is one of the most advanced techniques used to create synthetic data. It can be described as an unsupervised learning task that involves automatically discovering and learning the insights and patterns in data in such a way that the model can be used to output new examples that match the same distribution as the real-world data it was trained on.
Training generative models often starts with gathering a large amount of data in a particular domain (e.g., images, natural language text, tabular data), and then training a model to generate more data like it. The generative models described below have different architectures, but are all based on neural networks - and fundamentally leverage the same approach of utilizing a number of parameters smaller than the input data they were trained on - which essentially forces the models to discover patterns and insights in the data in order to generate additional datasets.
Two Common Approaches to Generative Models
Generative Adversarial Networks (GANs) - treat the training process as a game between two separate networks - a generator network, and a second discriminative network that attempts to classify samples as either coming from the real world (training distribution), or coming from the model (synthetic data). On each training iteration, the generator adjusts its model parameters to generate more convincing examples to (effectively) fool the discriminator, until, ideally, the discriminator is no longer able to differentiate between the real world and synthetic examples.
Language Models (LM) - such as Recurrent Neural Networks (RNN) and Transformers attempt to learn the underlying probability distribution of the training data, such as a sequence of words or tokens, so that it can easily sample new data from that learned distribution (or effectively, predict the next token or words in a sentence). While simple in concept, language models are arguably the most powerful algorithm for generative text today. They demonstrate the ability to learn and recreate both short texts (sentences, tweets, etc) and longer texts (entire chapters of books, or complex time-series sequences of data) by training on massive amounts of data.
The image above is from “Attention is All You Need” by Ashish Vaswani et al, Google Brain. It shows the attention mechanism in a transformer model following long-distance dependencies in the encoder. As you can see, many of the attention heads are going from the word “making” to “making more difficult”. Popular language models include Recurrent Neural Networks such as LSTM or GRU networks, and Transformers (such as OpenAI’s GPT-2 and GPT-3) which introduce the ability to look at both past and future elements at the same time. They can also achieve faster training times by avoiding recursion and processing sentences as a whole via attention mechanisms and positional embeddings.
Are Language Models Better than GANs for Synthetic Data?
The answer is not so simple. Both language models and GANs have different pros and cons, making them perform well on different tasks. As such, at Gretel, we like to think of algorithms as being part of a toolkit that can be applied in different ways for different problems. For example, we have found GANs to be very effective and efficient for generating synthetic image data, and learning complex relationships in time-series data, but can be difficult to optimize due to unstable training dynamics and are generally outperformed on text generation tasks by language models. When working with tabular data, language models are able to quickly learn and generalize datasets, but at higher computational costs than GAN equivalents.
Regardless of whether the next generation of generative models are powered by GANs, LMs, or even something new (check out generative Diffusion Models), we are incredibly excited about the capabilities that generative models make possible with synthetic data.
Advanced Features with Generative Models
Not all models are created equal, and different architectures offer benefits in certain areas. Here we discuss two features that are particularly useful with synthetic data.
Conditional Data Generation - Some generative models support model conditioning (sometimes referred to as “seeding”, or “prompting”), which is a technique that enables the model to generate more records that match a certain class or label, versus simply recreating the distribution that it was trained on. This technique can be used to balance class distributions in datasets for more accurate or ethically fair machine learning. For examples, check out our blog on reducing biases in the US Adult Income prediction dataset, and improving the accuracy of UCI’s popular heart disease detection dataset by augmenting it with additional records for female patients.
Synthetic Data Privacy - We know that when training a machine model, we are attempting to teach the model to learn from real-world data. As such, it’s possible for the model to memorize and replay sensitive data from the training set - such as a name, credit card number, or sensitive medical information about a patient. For many uses of synthetic data, privacy is a core use case - so being able to apply techniques such as differential privacy during the training process can provide mathematical guarantees that the model will not memorize or be able to replay sensitive data that it was trained on. Furthermore, many users seek to find a balance between privacy guarantees and the accuracy of their dataset for a particular task. For more information about privacy and synthetic data, check out Gretel’s deep dive on privacy with synthetic data, and also comparing the accuracy of private synthetic data vs real world data for 8 top datasets on Kaggle for machine learning tasks. Spoiler alert: Gretel.ai’s synthetic datasets average within 2.58% of the accuracy of real-world data across the top 8 datasets on Kaggle.
Synthetic Data Generation Tips and Best Practices
Here are some tips and best practices for generating clean and accurate synthetic data.
Increase your training data
The number of training records used can directly impact the quality of the synthetic data created. The more examples available when training a model, the easier it is for the model to accurately learn the distributions and correlations in the data. Always strive to have a minimum of 3,000 training examples. If possible, increasing that to 5,000 or even 50,000 is even better.
Increase your synthetic data
The more synthetic records generated, the easier it is to deduce whether the statistical integrity of the data remains intact. If your synthetic data quality score isn't as high as you'd like it to be, and you’ve generated less than 5,000 synthetic records, generating more synthetic records is a good way to deduce if there really is a quality issue.
Clean your data first
As in any statistical or machine learning analysis, the first step is to clean up your data. Tending to the following issues can be vital in creating quality synthetic data:
Handle missing values
Assess the extent of your missing values. A moderate amount of missing data can be easily handled by the synthetic model. An excessive amount can lead to difficulties in accurately learning the statistical structure of the data. Decide if it's appropriate to drop columns or rows with missing data, or if it's more appropriate to attempt to fill in the missing fields using, for example, the median or techniques such as KNN imputation.
Remove redundant fields
Study the correlation matrix in the Gretel's synthetic performance report, and remove unnecessary, highly correlated fields. This is particularly important when the dataset contains a large number of columns.
Consider removing highly unique fields
A large number of highly unique fields, or just one highly unique field that is exceptionally long, such as an ID can cause the model to struggle to learn the patterns in the data. If possible, consider removing the field or fields before generating synthetic data, and adding them back in afterwards.
Remove duplicate records
If training records are duplicated in error, then any statistical analysis of the data will be impacted. A large number of duplicated records can also cause the model to see it as a pattern it needs to learn resulting in the potential duplication of private information in the generated synthetic data.
Deal with anomalies
Assess whether there are anomalies in the data, and if so, whether they are errors or true data points. In general, the synthetic model will usually be robust to anomalies, but on the occasion that it's not, the replay of an anomaly in the synthetic data can potentially constitute a serious breach in privacy.
Simplify your data where possible
If a long categorical field can be replaced with a simpler set of integer labels, then do so. If a long numeric field can be grouped into a smaller number of discrete bins, then do so. If a floating point value has excessive levels of precision, remove them. This step is rarely needed, but if you find the model is struggling it may help improve performance.
Working with highly-dimensional datasets
Datasets with high counts of columns (50+) and rows (100k+) can be tricky for synthetic data models, as dimensionality can increase training time, and require tuning for the neural network such as the optimizing learning rate and batch size. Working with a highly dimensional tabular dataset? Check out Gretel.ai’s trainer notebook, which enables parallel training, retries, and support for datasets with up to thousands of columns and millions of rows.
Sanity check CSV format and column data types
Test reading your training CSV file into a Pandas DataFrame. Sanity check that the columns are as expected and that there are no warnings about columns with multiple data types.
Handling tough fields
- Exceptionally long fields - Fields with an average length of more than 30 characters can cause the model to struggle. Consider whether the field is really necessary. If so, consider anonymizing that field separately with Gretel's blueprint that automatically finds PII in text and uses atomic transformations to anonymize. The remaining fields can be generated with a synthetic model, and the troublesome field added back in afterward.
- Connected fields - When one field is a derivation of other fields, the integrity of that relationship may not be maintained in the synthetic data. One example is if a field is the sum of two or more other fields. Another is if one date is always a certain number of days away from another date. If maintaining the relationship is important, remove the derived field from the training dataset, and recompute it after synthetic data generation.
- Fields with highly critical distributions - When the distribution of one field is vital to the dataset, such as a date in a time series, and exactness of the distribution is not maintained in the synthetic data, consider using that field as a "seed" for generating the remaining fields. Refer to our blog on time series for an example of using fields in this way.
- Highly unbalanced fields - When a field is highly unbalanced and you wish to mediate that, refer to Gretel's blueprint for automatically balancing your data. If the field in question is also the target of a machine learning task, refer to our blueprint for boosting a massively imbalanced dataset that uses a SMOTE-like technique to steer the synthetic model to offset the imbalance.
- Highly structured fields - The synthetic model actually thrives on structure; however, if you have a field with an excessively long, complicated structure, consider dividing that field into multiple easier fields.
Generate Robust and Highly Accurate Synthetic Datasets With Gretel
At Gretel, we’re laser-focused on developing state-of-the-art APIs and algorithms for generating synthetic datasets, including the privacy-protection mechanisms necessary to enable the creation of guaranteed safe data. We’re also making sure all of these products and features are accessible via easy-to-use APIs, so developers and data scientists can be confident they are getting high-quality synthetic data whenever and wherever they need it.
Going forward, we will continue to post research, source code, and examples about enabling data sharing and access at scale. Sign up for our free tier and give Gretel a try, you can run Gretel’s APIs to Synthesize, Transform, or Classify data – no code required.