Machine Learning Accuracy Using Synthetic Data

Can synthetic data really be used in machine learning? We explore the utility of synthetic data created from popular datasets and tested on popular ML algorithms.

One of the questions we hear often from developers working with synthetic data is “how well does synthetic data work for ML use cases versus the original dataset?” The question refers to whether the synthetic data can really be used to produce a machine learning model in par with what could have been created if the original data had been used. Let’s dive in and find out.

Our Experiments

To narrow in on test datasets to use, we queried the data section of Kaggle with the “Classification” filter turned on, searching for datasets with significant activity. Our criteria was the dataset had to have at least 4000 records and be textual in base (vs image or audio).  Below is a table listing the datasets we used, as well each dataset's row and column count. When training the synthetic model, all rows and columns were used with no modifications. All training parameters used default settings for gretel-synthetics, and we generate 5000 synthetic records for each dataset.

Kaggle Dataset Rows Columns Classification Task
Telco Customer Churn 7043 20 Predict which customers will leave
IBM HR Attrition Dataset 4408 35 Predict which employees will leave
UCI Adult Census Income 32561 15 Predict who earns more than $50,000
UCI Credit Card Default 30000 25 Predict default on a credit card payment
Adventure Works Bike Buying 16386 24 Predict customers who will buy a bike
Stroke Prediction Dataset 5109 12 Predict who will have a stroke
Data Scientist Job Candidates 19158 14 Predict which data scientists are willing to change jobs
Bank Marketing Dataset 11162 17 Predict Term Deposit Subscriptions
Public datasets


The complete code used in these experiments can be found here, and all original datasets and the generated synthetic records can be found here.

We start by building the list of machine learning algorithms we’ll be testing, as shown below.

We next read in both the original and the synthetic data, then use sampling to ensure both sets are the same size and have the same ratio of positive and negative samples (you can see those details in the full code). For each dataset/model combination, we'll run a 5-fold stratified cross-validation 5 times. As shown below, we set up a pipeline to handle missing fields, encode categorical fields, standardize all fields and then run the model.

Results

We then graph each dataset showing the relative performance of each machine learning algorithm on both the original data and the synthetic data. As you can see in the graphs  below, the synthetic data runs do quite well!

Data_Scientists_Jop_Candidates
Stroke_Prediction\
Bank_Marketing_Dataset
UCI_Adult_Census_Income
Adventure_Works_Bike_Buying
Telco_Customer_Churn
IBM_HR_Attrition
UCI_Credit_Card_Default

And finally, we graph a quick synopsis of all the datasets:

ML Accuracy Using Original vs Synthetic Data


Conclusion

These datasets were chosen with no attempt to highlight only the successes. At Gretel.ai, we’re really proud of how well our synthetic data generation holds onto the statistical integrity of the original data. While tougher datasets might exist, there are also many ways to tune a synthetic model that were unused in this set of experiments.

Synthetic data plays an important role in the future of Artificial Intelligence. Beyond the hurdle of swift access to sensitive data, companies often lack enough relevant data to effectively train a model. To remedy this, synthetic data can be used to augment the original training data. This is particularly true (as described in our earlier blog) when instances of the positive class are rare (such as in fraud or cybersecurity). Synthetic data can also be used to broaden the variety of examples used in pre-production testing scenarios.

Whatever your use case, Gretel provides privacy engineering as a service delivered to you as APIs. You can synthesize and transform data in minutes enabling safe, trusted AI systems without sacrificing utility.