Machine Learning Accuracy Using Synthetic Data

Can synthetic data really be used in machine learning? We explore the utility of synthetic data created from popular datasets and tested on popular ML algorithms.

Published by

Amy Steier

No items found.

•

Updated

August 26, 2024

One of the questions we hear often from developers working with synthetic data is “how well does synthetic data work for ML use cases versus the original dataset?” The question refers to whether the synthetic data can really be used to produce a machine learning model in par with what could have been created if the original data had been used. Let’s dive into machine learning accuracy and find out.

Our Experiments

To narrow in on test datasets to use, we queried the data section of Kaggle with the “Classification” filter turned on, searching for datasets with significant activity. Our criteria was the dataset had to have at least 4000 records and be textual in base (vs image or audio). Below is a table listing the datasets we used, as well each dataset's row and column count. When training the synthetic model, all rows and columns were used with no modifications. All training parameters used default settings for gretel-synthetics, and we generated 5000 synthetic records for each dataset.

Public datasets
Kaggle Dataset	Rows	Columns	Classification Task
Telco Customer Churn	7043	20	Predict which customers will leave
IBM HR Attrition Dataset	4408	35	Predict which employees will leave
UCI Adult Census Income	32561	15	Predict who earns more than $50,000
UCI Credit Card Default	30000	25	Predict default on a credit card payment
Adventure Works Bike Buying	16386	24	Predict customers who will buy a bike
Stroke Prediction Dataset	5109	12	Predict who will have a stroke
Data Scientist Job Candidates	19158	14	Predict which data scientists are willing to change jobs
Bank Marketing Dataset	11162	17	Predict Term Deposit Subscriptions

The complete code used in these experiments can be found here, and all original datasets and the generated synthetic records can be found here.

We start by building the list of machine learning algorithms we’ll be testing, as shown below.

We next read in both the original and the synthetic data, then use sampling to ensure both sets are the same size and have the same ratio of positive and negative samples (you can see those details in the full code). For each dataset/model combination, we'll run a 5-fold stratified cross-validation 5 times. As shown below, we set up a pipeline to handle missing fields, encode categorical fields, standardize all fields and then run the model.

Results

We then graph each dataset showing the relative performance of each machine learning algorithm on both the original data and the synthetic data. As you can see in the graphs below, the synthetic data runs do quite well!

And finally, we graph a quick synopsis of all the datasets:

ML Accuracy Using Original vs Synthetic Data

Conclusion

These datasets were chosen with no attempt to highlight only the successes. At Gretel.ai, we’re really proud of how well our synthetic data generation holds onto the statistical integrity of the original data. While tougher datasets might exist, there are also many ways to tune a synthetic model that were unused in this set of experiments.

Synthetic data plays an important role in the future of Artificial Intelligence. Beyond the hurdle of swift access to sensitive data, companies often lack enough relevant data to effectively train a model. To remedy this, synthetic data can be used to augment the original training data. This is particularly true (as described in our earlier blog) when instances of the positive class are rare (such as in fraud or cybersecurity). Synthetic data can also be used to broaden the variety of examples used in pre-production testing scenarios.

Whatever your use case, Gretel provides privacy engineering as a service delivered to you as APIs. You can synthesize and transform data in minutes enabling safe, trusted AI systems without sacrificing utility.