Tutorial

Using Pycaret to Evaluate the Accuracy of Synthetic Data

December 14, 2022

How you can use a Python library called Pycaret to evaluate the quality of your synthetic data for machine learning use cases

Read the blog post

Read the post

Transcription

Speaker 1 (00:01):

Today we're going to walk through how you can use a Python library called Pycaret to evaluate the quality of your synthetic data for machine learning use cases. picart, as you guys can see here, very popular, nice open source framework. It advertise itself as being a low code machine learning framework. One of the features that it has that's really nice is it makes it easy to run a battery of tests, so a ton of different classification algorithms on a particular dataset and let you see the results across all of those different algorithms. So in this case, we'll be using piker to evaluate the quality of synthetic data that we generate versus the real world data that it was trained on.

(00:39)
You can follow along here by going to our blog and clicking on the downstream ML classification use case here at bruel. What we'll be doing here, and you can follow along in a notebook if you'd like just by clicking on the CoLab link, as we will be generating a synthetic version of a real world multi-class classification dataset and measuring the quality of synthetic data. A model trained on synthetic data to predict whether a user's going to buy pizza versus the real world dataset that it was based on running through this notebook takes a while since it runs a battery test and we're generating synthetic data. So about 20 minutes, let it run through, so we don't have to do that now. I went ahead and ran it in a notebook here, and we'll just walk through the results and walk through the code and explain what's happening.

(01:26)
To get started here, we're going to, in our CoLab notebook environment, install some based libraries. So we're installing PI curate, we're installing RAD clients. You can see these guys get installed as we go through here. We can see first we're going to install, use the Gretel client essentially and log into the API use it to create a synthetic version of this machine learning data set we have here. So we're looking at a grocery order data set. Our downstream use case is given all this knowledge about a user predict whether the user's going to order pizza or not. So we have the real world dataset right here. No need to generate that. We are going to first train a synthetic model. We're going to use our Acton model to do this, to create another dataset that has the exact same size dimensions and insights as the original data, but it doesn't actually repeat any of that data.

(02:19)
Lots of use cases for synthetic data, whether it's privacy, whether it's augmenting limited data sets, things like that. First question we're going to ask ourselves though is how well does it perform? Here we see the GRETEL report, which is meant to be very general. That's talking about the quality of the data, objectively looking at correlations in the data distribution and things like that. Very useful, but it doesn't answer the question of how well is this data going to perform on this task that I'm intending to use it for. So here we're going to talk about how you use Piker to do the same evaluation, compare it against a ton of different classifiers, and then take the average score that we see across it and see how well each model

Speaker 2 (02:58):

Performs. So we start here by taking our synthetic dataset and we're going to do is create some test train splits. What does that mean? So we're taking 100% of the real world grocery orders dataset and we're taking the synthetic dataset. How do we compare these guys? So what we want to do is create a test set and a training set. The training set will be used to essentially train these different classifiers. Here we see a variety of different ones from K's Neighbor Random Forest, essentially the classic data science classifiers here and find out which model is going to work the best. Then we'll compare the optimal model for the synthetic dataset against the optimal model for the real world dataset against the real world training data. So we're essentially and got a couple nice graphs I can show you here for what's happening. We are splitting out the original data into several different pieces.

(03:56)
So it's going to go here. We see about 25% of it's going to go to train the classifier. We're going to hold out some of that data. So the holdout here is 20% of that data. We're going to move completely to the side model, never sees it, and the intent is that when the models are fully trained against the training data and against the validation data, which will leak a little bit into the model, they're run against data they've never seen before, which is this test set. That's the goal. That's what the chain test split is. And then Piker does some really kind of fancy stuff for us as well. Essentially, it removes the need for us to have a validation set, which is another 5% loss of our data. And how it makes up for that is it uses something called folds. So we'll jump over right to the SK learn metrics here and learn what a fold is.

(04:48)
So here once again, we see a roughly 20% test split taking our training data, the entire dataset from grocery orders, splitting out 20% and something that the model will never see. We use for final evaluation after the model's been trained, but instead of needing to take another somewhere between five and 20% to use for validation, we end up randomly selecting 5%, and these are called a fold of the data, or 20% in this case, 20% of the data that is used at each different split. So we run multiple tests against multiple classifiers each time splitting the data. What this helps us make sure is that we don't over optimize for one randomly selected validation training set. Rather it gives us a more comfort that our model's going to perform well against the entire data set.

(05:37)
So jumping back here, the first thing we did is we ran a test on our synthetic training data. So this is the synthetic data sets. How well did the classifiers do against the synthetic test set that we built using entirely synthetic data. Here we got an accuracy of about 93% on the synthetic test set average for some of these top classifiers. The best result we got as we can see here, is the K nearest neighbors classifier. So that's the one we're going to use in the future. So we'll jump down a little bit and then run the same test. You'll see here on the real world data. So on real world data against the real world test set, we see the case nearest neighbors classifier here. So being split out again, we see it getting about 94%, so about a percentage point higher using the accuracy, the original training data versus original test split. So once again, Ks neighbors worked best for this dataset above these other algorithms. There's a level of randomness. If you run this a few times, you might get different results, but usually these guys, random forest or kst neighbors are going to do very well on this dataset.

(06:51)
The next thing we're going to do is run our synthetic data. Now that we've picked our optimal model, you can see it here. We're going to run against the original test data. This is the 20% we split out. We never looked at. No model has seen it so far and say, how well does the model trained on our synthetic data algorithm used in Ks neighbors perform on that data? So here we see the original results here, 95%. So it actually did very well on the real world data a little bit better than it did on the synthetic test dataset. Exciting results here. Let's go ahead and take a look here. We're going to go down, we're going to check again and see how the best model prediction did on using the real world data compares against the synthetic data. So 95.26% for the synthetic data we're going to go down.

(07:37)
Once again, we picked the best classifier, and this first example here, what we run best algorithm against the original test data using the model that was trained on the original real world data, 95.16%. So actually the synthetic data model here outperformed on the real world test set, outperformed the data that was trained on real world data. Lots of reasons why that could happen. Sometimes oversampling techniques work very well. You can use it to create additional examples. In this case, we created the same amount of data points for our synthetic test set as we had an original real world test set. So it appears that the model recreated the data in a distribution that helped this model learn a little bit better, resulting in the synthetic data, actually outperforming real world data on the real world test set by about one percentage point. We do a bunch of other tests here as well, but that kind of summarizes how you might run your own type of test. Essentially take a real world dataset, split it so you have an 80% training data and you have 20% test data from that 20% or 80% of training data. You split it again and you figure out what the best classifier is to use on that test set. That data then trains a classifier that we run against the original real world test set, both for the

Speaker 3 (08:54):

Synthetic version and the real world version of the training data. And we can compare the results often. They are extremely close. We're using a deep learning Gant based model here. So it's very good at learning and recreating the patterns and dataset. In this case, the synthetic data actually came out a little ahead of the real world data.

Using Pycaret to Evaluate the Accuracy of Synthetic Data

Video description

More Videos

Generating differentially private synthetic text using Gretel GPT

Anonymize Financial Data with a Fine-Tuned SLM

Video - Generate synthetic data for training LLMs and SLMs

Read the blog post

Transcription