Walkthrough: Create Synthetic Data from any DataFrame or CSV
Train an AI model to create an anonymized version of your dataset using Python, Pandas, and gretel-synthetics.
Today we're going to walk through using Gretel's apis to create synthetic data from a CSV or Pandas DataFrame. Let's jump right in. You can find the notebook that we're going to walk through on our GitHub, or go ahead and launch the notebook in Colab, Google's free notebook hosting service.
It's a really simple install here- all we have to do is install the Gretel client, so a simple PIP install. We also need an API KEY which allows us to interact with the api services. While that's installing we'll go ahead and log into the Gretel console which is https://console.gretel.cloud. Grab a copy of our api key. All right I'm going to go ahead and kick off the rest of this project while we talk through what's happening.
The first thing that we're going to do here is use the Python SDK which interacts with Gretel's API services. Let’s start by creating a project. We're going to name it synthetic data. You can name it whatever you would like. Now for the fun stuff. We're going to pull down one of our template configs so these configs essentially define a set of neural network parameters that are used by our synthetic model based on the data that's been trained on so here we can see that we went ahead and downloaded the default configuration which is a pretty sane configuration that should work for most data sets. You can browse other configurations by clicking on this link as well and it will go ahead. The next step is go ahead and load our sample data set that we want to create a synthetic version of into a DataFrame so here we can see we'll load up Pandas.
We'll define the path here so we're going to load this from Amazon S3 but you can load any local CSV file that you would like into your DataFrame and we'll go ahead and preview that DataFrame and we're also going to save that CSV file out to a local file. We can see we're looking at a financial data set- it's about 5,000 rows and 15 columns so it should be a nice challenge for the synthetic model to see if it can learn on this little data. Let’s go ahead and save this file off to disk and then we'll use that to start training our model.
Here you can see we're creating the model, we're submitting it to the api service and telling the model to go ahead and train a model using the Gretel cloud. You can run this locally in your environment, but running in the Gretel cloud allows you to get by without having a gpu or needing to configure tensorflow cuda or the underlying dependencies here we can see that the model finished training you see the accuracy here on both the test set and in the validation set is quite good so we're at 90% accuracy loss is pretty low. Typically loss anywhere under one will indicate you've got a pretty good model and we're going ahead and using the model to generate 5000 records.
We are going to use the 5000 records to profile the synthetic model and figure out how the job it's doing at creating synthetic data. Also, we will go ahead and create a report and we'll go ahead and take a look at that report next. Here is a version of our synthetic DataFrame so this is the entirely artificial data set that was created by our model you can see it mimics the input data very closely so you can see different age distributions working class the same type of fields that we came to expect when we were looking at the training data. The next question we have is how good of a job did the model do at retaining the insights, characteristics and correlations from the original data set?
So let's go ahead and dive in and take a look at that for that we have the synthetic quality report which we'll go through right now essentially what we're doing is throwing the kitchen sink from a statistical perspective comparing the output of the model those 5000 records we generated versus the input data just about 5000 records we put in. Here you can see overall we got an excellent score and the different measures that we're taking a look at here to kind of assess the the correlations and the distributions per field are holding up quite well from a privacy perspective none of the training data was duplicated so these are entirely new records that were being created.
Here we get an overview of every field and there'll be a nice deep dive that we can look at as we go deeper down on the report this is my personal favorite to look at so here when we look at the training and synthetic data correlations. What we are looking at here are correlations that exist between each field so you obviously expect a strong correlation between the same field going up but we can see other cases here where an integer here that appears to be identifying the education is always the same so strong correlation between those fields. And we see weaker correlations between things like relationship and age for example or relationship and marital status where we see nearly 50% correlation between the input data what we're looking for here is we want to see that the synthetic data model is able to as closely as possible recreate the correlations that exist in the original data set.
Another really handy view that I like is using PCA. So, PCA is a dimensionality reduction technique favorite toolkit in the arsenal for data scientists and what we looked at like to look at for here is essentially compressing those 15 columns and those 5000 rows down into two different dimensions that we can look at through dimensionality reduction and what we look for here is a similar shape between the synthetic data and the training data similar size distribution of the different lines so this looks quite good.
As well from there we can dive in on a field by field comparison. How many of these were recreated in synthetic sets versus how many of them existed in the original training set? And you want to see as close of a line as possible without having things obviously be exactly the same. It looks like the synthetic model once again learned this quite well we see very similar but different distributions between synthetic and the original training data
From there we can use our model and this last line here shows us a way to use the model to have our record generate as many more records as we want so here we are passing our model request to generate another 100 records which it went ahead and did very quickly and we can load that as a essentially wrote that load that csv directly from the API service in as a DataFrame and use our favorite data science tools to work with it.