Helpful resources to start making your job easier.
How to generate synthetic data with Gretel
In this tutorial, Alex teaches you how to generate synthetic data with Gretel.
Go ahead and log into the console using a Google or a GitHub address. From here, we can either download an API key and the console to get started, or we can use the console workflows generate synthetic data. I like the console workflows is a good way to kind of dip your toe in with synthetic data or see how a different data set will perform. From here, you have the option of just dragging in a sample CSV to make things a little bit easier to visualize here. I'm going to go ahead and load a CSV from my own computer.
Let's go ahead and take a look at this. So, we'll open this up in Numbers. Take a look. So, you see some pretty nice mixed categorical numerical data, things like that. This is a reference data set that was released by the U.S. Census that's used commonly for machine learning tasks. So, it's a good data set to start with. We're going to see if we can create a synthetic version of this dataset, looks just like it has the same dimensions, same insights, same look, and feel, but... In which case, none of the records that we're looking at here have actually been duplicated.
So from here, we'll go ahead and choose file. We'll load that same adult income data file, that we were looking at a second ago. I'm going to go ahead and upload this to Gretel. One of the neat things about Gretel is you have the option to pick where your data sets run. Using the CLI, you can choose to run locally on a worker where your data never leaves your environment, or you can go ahead and have the Gretel cloud compute this synthetic data for you.
We've uploaded our data set here, let's chose generate synthetic data. Go ahead and click continue. So, here we can see, as we saw a second ago, 15,000, 14,000 records in 15 fields. From here, we can go ahead and click continue. It's that easy to go ahead, Gretel trained model and generated data set for us.
What I'm going to do is go ahead and customize this, because I know this is a workflow that a lot of data scientists like to have a little bit of control over. There are a ton of different fields that can be experimented with in the Gretel of synthetic data model. So, you can go here to our docs and click on synthetics and learn all about the different options.
What I'll do here is I'm going to go ahead and browse templates on GitHub, and we're going to downloaded and modify our own configuration file. So, click browse. This takes you right to your GitHub, and here you can see a set of different templates that we have for different use cases with synthetic data. Whether you're working with complex or free texts, if you're training on tweets or chat records or things like that, that have raw text in it. If you want to train with differential privacy, each one of these are different configurations that we can use.
I'll start with the default and just make a quick modification to it. Here, we can see the default configuration settings. Let me go ahead and grab these and copy them. Once again, if you're just getting started for the first time here, you can just click use the recommended settings, which uses default and get started. I'm going to go ahead from here, open up a terminal. I cut and paste the records in there. And I'm going to make one change to this setting. I'm going to have it just run for 50 epochs.
One of the things about Gretel, it uses early stopping. So the number of epochs, which is the amount of times that we train the neural network on the data before the model gets finalized, is automatically set. When it starts/stops essentially gaining loss, it will go ahead and terminate the model for you. So, this setting, you should never have to use it's just a good example.
We'll go back, and we'll choose our file now. So, we just created a default.yaml. We'll go ahead and select that. We'll continue. What's happening, now? It's loading the example data set that we uploaded here. We can see a sample of that dataset. It's going to take that. It's going to take a look at our training configuration that we just uploaded. We can see all the different settings inside of here. So, 50 epochs as we configured. Other settings here, that you can change based upon your use case. We've got a lot of documentation on that as well. From most datasets though, you should be good to go with one of her default configurations.
What's it's doing right now is it's going ahead and reaching out to the Kubernetes cluster on Gretel's back end here, and it's starting a worker. Big difference between beta-2 and beta-1 that we've noticed here, is moving from requiring you to run your own model or a backend using TensorFlow and a GPU for acceleration, to running this automatically on demand on behalf of a training job when it comes in. So, we can see we are going ahead and starting the process. It's found a worker. It's beginning ML model training, and it's going ahead and train the model now. This will take a few minutes.
I can now see that the model completed training. So, it did 49 EPoX or passes over the data. As you can see here, the accuracy was increasing and loss was decreasing, as well as the validation set, which we use to prevent model over fitting. Accuracy increasing. Loss decreasing. Accuracy here about 90%. Loss here at around 0.40. Anecdotally, anything less than one for your loss, it's usually an indication that you have a good model. The model's learned your data and will be able to recreate it pretty well.
Next, what's happening is the model here is generating a sample set of 5,000 records. We use this to essentially validate the models, learn the model well, the synthetic data set that it's trained on. And we'll use that to create a report that compares the training data that was originally trained on to the synthetic data you generate, and make sure that the model is able to learn and maintain the different correlations and insights that existed in the original data.
The output of this process is a neural network model that can be used to generate as much data as you want, and we'll walk through doing that as well. Here, we can see that it generated 5,000 records. 124 of those records does not pass validation. So, in addition to the neural network, which learns the structure of your data, it does a set of validation checks, essentially making sure that each value that it sees in a column makes sense based on what it saw on the input data.
After that it uses the 5,000 records that we generated to create a synthetic data quality report. This is answering the question of how synthetic is my synthetic data and will it work well for my use cases. Here you can see we have an excellent model. What that means for us kind of anecdotally, or for many customers, is that this model can be used to create synthetic data sets that could be used to train in the downstream machine learning task, or it could be shared with data scientists across your organization. Things where people are looking not only for the size and the shape of the data to look accurate, but they want to have the same insights in the data as the original source data.
Here, we see a nice kind of high level overview. You can see the distribution, stabilities, everything lining up and looking good. Also, if you want to dive in and go a level deeper, you can download the synthetic data report and get a deep dive on our data, which we'll go ahead and do now.
So, I'll go ahead and open up this synthetic data report for our model. And this is where from a data science perspective or from running a machine learning pipeline, I can really dive in and say, does my data set have the same types of correlations, the synthetic data versus the original training data here. And you can see visually that it appeared to have learned and recreated these correlations very well.
Another one that I really liked to look at is the PCA. So the principal component analysis. Essentially what it does is takes highly dimensional data, compresses it down into a 2-D plane where you can look at the distribution of elements across both data sets. And what you look for are similar sizes and shapes between the data that would indicate, not only did it learn the data well, but it recreated even down to the different types of outliers that exist in the original data. It recreated them well.
From here, we can do a really per field view, hours per week. And what we're doing here is comparing the synthetic data model, put it outputted towards the original data that was observed. So, essentially what we're doing here is throwing the kitchen sink from a statistical perspective at the synthetic data set we created, comparing it to the original data set, and giving you a high sense of confidence that your model that's been created is going to work well for those use cases you have around, creating mock data or a pre-production environment or sharing with partners.
So, our data has been created. We can go ahead and view those initial records as well. So here we're going to go right. It's going to take us to our project view, and we'll look at the model. So when the US adult income data set, looking at models, let's say that we've got a data set that's been created. I'm going to go ahead and generate another 5,000 records. It will be a 80 pound valid data set. You can also download the synthetic data set that we just created, along with our model. That's preloaded 5,000 records.
Initiating a new job. We'll spin up a worker, identify a worker, and start creating dataset using this model that we've created where you can look at the distribution of elements across both data sets. And what you'd look for are similar sizes and shapes between the data that would indicate, not only did it learn the data well, but even down to the different types of outliers that existed in the original data, it recreated them well.
From here, we can do like a really per field view hours per week. And what we're doing here is comparing the synthetic data model, put it outputted towards the original data that was observed. So, essentially what we're doing here is throwing the kitchen sink from a statistical perspective at the synthetic data set we created, comparing it to the original data side, and giving you a high sense of confidence that your model that's been created is going to work well for those use cases you have around creating mock data or a pre-production environment or sharing with partners.
So, our data has been created. We can go ahead and view those initial records as well. So here we're going to go right. It's going to take us to our project view and we'll look at the model. So when the US adult income data set, looking at models, let's say that we've got a data set that's been created. I'm going to go ahead and generate another 5,000 records. That will be a valid data set. You can also download the synthetic data set that we just created along with our model. That's pre-loaded 5,000 records.
I'm initiating a new job. We'll spin up a worker, identify a worker, and start creating dataset using this model that we've created. Looks like the data sets completed. We can go ahead and download that data set. Take a look at our newly created synthetic dataset. We'll open it up numbers as well. You can see, we have a nice looking data set here that very closely matches the original data that was training.
On the model view, you can have as many models as you want to per project. And go ahead and look at the data source that was originally uploaded along with the model. You can also view the model configuration that was used to train that model originally. Like within here, I can pick config. You can Also invite new members to our project. So, you can invite via email address. This will invite other people to collaborate and be able to use your model to generate data.
When you're done, if you choose to delete your data, you can simply click here and choose nuke my project and delete the data all the way through. If you hold down the button, it'll go ahead and delete your data.
Generating Synthetic Data for Healthcare & Life Sciences
Enabling faster access to data for medical research with statistically accurate, equitable and private synthetic datasets.
Build a synthetic data pipeline using Gretel and Apache Airflow
During this webinar, we’ll build an ETL pipeline that generates synthetic data from a PostgreSQL database using Gretel’s Synthetic Data APIs and Apache Airflow.