Use AI to Create Synthetic Data from a DataFrame or CSV

Video description

Apr 20, 2021

In this tutorial, Alex walks you through how to use AI to create synthetic data from a DataFrame or CSV using Gretel.


Today, we're going to walk through an end-to-end example of generating synthetic data using Gretel.ai services. First, we'll go ahead and sign in. You can go to console.gretel.cloud, and you can sign in with a developer ID or GitHub address there. From here once we're signed in, we'll go ahead and select new project. Select blank project. You can name your project if you want to.

Once you've created your new project, we'll go to transform. From here, we have a set of Python Notebooks that help you address popular use cases that we see around using Gretel.ai. For example, today we're going to launch a notebook called Create Synthetic Data from a CSV or DataFrame. This opens us up into the Google Colab environment, which is a hosted notebook environment from Google. It's notable because it gives you free GPU access, which is really nice for training machine learning models. You also have the option if you wish of going ahead and downloading the notebook, or just downloading the raw Python to run in your own environment.

Once again, heavily recommended you run along with the GPU. We'll go ahead and click run all, and we'll go ahead and walk through what's happening here as we train a model and generate synthetic data. What's happening first is the notebook environment here is installing our open source packages. You can see the Gretel.ai client, the Gretel.ai synthetic data library and pandas are all being installed.

Next, it's going to ask us for an API key to authenticate for some of our premium services. We go back to the Gretel.ai console, choose integration, show API key, grab the API key, and we'll go back to the console here. To put an API key, we'll go ahead and download those premium packages into our environment. It will be accessing the SDKs for the rest of this exercise. The next step here is creating our CSV dataset. The ideal parameters to pass to our synthetic data library to create an artificial dataset are CSV, JSON, or pandas dataframe.

Here, we're going to go ahead and load a dataframe. We're using a popular healthcare analytics dataset from Kaggle. You can see here we're going to pull back 10,000 records. Read that using pandas into a CSV. Or sorry, into a pandas dataframe, that will pass to our synthetic data training library. Here, you can see it downloaded the CSV, loaded it into a dataframe, and we can take a look at it here. We have 10,000 rows, 18 columns. Really varied kind of data. So, this is a good example to use when training our synthetic model.

You have some integer data, some categorical data, floating points, dates, age ranges, things like that. All of these get learned. And really, our goal going through this exercise is to create another dataset that has a very similar distribution to this original dataset, but has none of the duplicated records. You get really kind of nice privacy benefits using this. Going through here, we see a configuration template. Tons of options here. This is really minimized. This is the bare minimum that you need to auto-generate a synthetic model.

If you want to go ahead and customize this yourself for your particular use case, you can simply go to our documentation. If you go to our open source library for synthetics, there's a link that you can grab right there. Let's go ahead and take a look at documentation for configuration. We are using a TensorFlow backend here, because it's the default model and it supports differential privacy-based training. We'll go ahead and look at the TensorFlow config. This shows us the options we have of setting here where we can configure anywhere from the number of epochs to the learning rate to the complexity of the RNN that's being used on the backend, and whether we'd like to use TensorFlow differential privacy or not.

In this mode, we're going to run without different privacy. Just use the default settings to generate a synthetic model. Here, we defined a vocabulary size, passed our tokenizer of 20,000, and we can see it going ahead and starting to build our model. We define our model here. We pass in a few parameters here. We're going to tell it to build a validator. What is a validator? A validator is code that runs. It's part of our premium package that takes a look at different columns across columns inside of your data. It looks at the ranges that are expected of the columns, whether it's categorical or floating, and making sure that the neural network outputs ranges that are seen based on what we saw the training data. Really, it gives you confidence that the data you're creating matches the shape and the distribution of your original data.

We have early stopping configured here to prevent overfitting from happening, but you see it here training the model. See the loss dropping pretty significantly. In my experience, anywhere less than typically 0.7, 0.8, you start to get pretty good quality synthetic data. You can see it very quickly converged on a good solution here. We don't want this to overfit. We also want it to run really quickly. So, what I'm going to do here is go ahead and stop this training after 20 epochs, and we'll go right over to generating data. Let's let it finish up or 20 epochs. Ought to be a good enough solution to get reasonable quality data. Let's stop there. Okay.

I went ahead and stopped that. Here, we're going to tell it to generate ... let's just have it generate a thousand rows. This happens really quickly, and we'll continue running. Here, you see two different fields. It's essentially loading the model right now, and we've told it to generate a thousand records for us. You see two different counters here happening at the bottom. Valid records are records that are being generated by a neural network that are passing validation. They're passing those inferred parameters that we learned looking at the training set. You can see it here very quickly building out those thousand records.

The next thing we'll do is preview that and take a look. Does this data look sane? Does it look like it matches our input? Here, you can see ... at least to initial inspection, it looks very good. Here you see, once again, we created a thousand rows. If we ask it to create 18 columns, we can see the columns even down to the age bracket seemed to very closely match what we had in our original data. To get that additional set of confidence ... so, you want to understand the distribution of data, and did our model actually learn those really cool correlations and things like that.

The next thing we do is generate a detailed report. As you can see here, this detailed report shows you distributions across every column between both the synthetic and the training datasets. And then also some really cool high level statistics. Really starting off with how many of the original training lines were duplicated in synthetic dataset. Here, we don't see any. And then we have a bunch of distance metrics that we can walk through here as well, looking at the distance metrics and the number of unique values, and distribution between the original dataset and the synthetic dataset we created.

One of my favorite ones to look at right here is the correlation graph. This is a column by column comparison of the correlations that existed in the original dataset, and how will they be replicate in the synthetic dataset. Here, this looks a little bit like a Minecraft sword. Essentially what we're trying to do is make sure that the correlations we see existing in this first dataset, for example, the number of stay ... it appears to be highly correlated with the number of visitors that you have, which is interesting. You want to make sure that gets replicated in the synthetic dataset.

The view below that is very similar here. It's just a subtraction of these two. Anywhere we see a color change here would indicate that there was a correlation that existed in one of the datasets that did not exist at the other. You can give us an indication if there's anything we need to dive into. Once again, we don't really see anything here yet. Once again, we can go through and look at the individual distributions, and see how well they match up. We only generate a very limited amount of data. We generated a thousand rows versus 10,000 rows of training data. So, you can't expect it to have perfect stuff here, but the correlation is actually in the distributions between these different categorical fields. And numeric fields actually appear to be very good, giving us an indication that we've got a healthy synthetic model we can use to generate data for our use cases.