Machine Learning Accuracy Using Synthetic Data
Can synthetic data really be used in machine learning? We explore the utility of synthetic data created from popular datasets and tested on popular ML algorithms.
One of the questions we hear often from developers working with synthetic data is “how well does synthetic data work for ML use cases versus the original dataset?” The question refers to whether the synthetic data can really be used to produce a machine learning model in par with what could have been created if the original data had been used. Let’s dive in and find out.
To narrow in on test datasets to use, we queried the data section of Kaggle with the “Classification” filter turned on, searching for datasets with significant activity. Our criteria was the dataset had to have at least 4000 records and be textual in base (vs image or audio). Below is a table listing the datasets we used, as well each dataset's row and column count. When training the synthetic model, all rows and columns were used with no modifications. All training parameters used default settings for gretel-synthetics, and we generate 5000 synthetic records for each dataset.
All original datasets, the generated synthetic records, the code used to create synthetic records and all code referenced in this blog can be found: (right now datasets exist on our public Google Drive folder called Datasets in a subdirectory called ML_Accuracy_Blog_Datasets.)
We start by building the list of machine learning algorithms we’ll be testing, as shown below. The complete code used in these experiments can be found here.
We next read in both the original and the synthetic data, then use sampling to ensure both sets are the same size and have the same ratio of positive and negative samples (you can see those details in the full code). For each dataset/model combination, we'll run a 5-fold stratified cross-validation 5 times. As shown below, we set up a pipeline to handle missing fields, encode categorical fields, standardize all fields and then run the model.
We then graph each dataset showing the relative performance of each machine learning algorithm on both the original data and the synthetic data. As you can see in the graphs below, the synthetic data runs do quite well!
And finally, we graph a quick synopsis of all the datasets:
These datasets were chosen with no attempt to highlight only the successes. At Gretel.ai, we’re really proud of how well our synthetic data generation holds onto the statistical integrity of the original data. While tougher datasets might exist, there are also many ways to tune a synthetic model that were unused in this set of experiments.
Synthetic data plays an important role in the future of Artificial Intelligence. Beyond the hurdle of swift access to sensitive data, companies often lack enough relevant data to effectively train a model. To remedy this, synthetic data can be used to augment the original training data. This is particularly true (as described in our earlier blog) when instances of the positive class are rare (such as in fraud or cybersecurity). Synthetic data can also be used to broaden the variety of examples used in pre-production testing scenarios.
Whatever your use case, Gretel provides privacy engineering as a service delivered to you as APIs. You can synthesize and transform data in minutes enabling safe, trusted AI systems without sacrificing utility.