Setting up your Gretel environment (3/4) - Generate synthetic data locally from the CLI
Video description
In part 3 of this video series on setting up your Gretel environment, Alex shows how to generate synthetic data locally from the Gretel CLI.
More Videos
Read the blog post
Transcription
(00:05): Hey, in part three of this video, we are going to train a synthetic model using a CLI on a local box, and this box could be located on premises, it could be your workstation, it could be inside one of the cloud providers, really your choice. One of the neat features of this is that your data never leaves your environment, so it really works for many compliance control to really sensitive use cases. How this works is by deploying a container, a docker container inside your environment that talks to a GPU accelerated box to train a synthetic model and generate as much synthetic data as you want. CLI is a really easy way to get started, so why don't we go ahead and jump in.
(00:40): So we're going to go down CLI examples, create synthetic data and follow the instructions that it displays here. First thing we're going to do is create our new projects, which we're going to call healthcare and set it as our default project. We are set up and running there. We're going to download a sample data set. I think I've already downloaded this data set here, but we'll go ahead and run it again. Okay, and why don't we go ahead and take a look at our sample data set here, so we're going to run a head command on it. You can see here about 20 different fields, mixed categorical and numeric data, and some kind of interesting encoded data here. Age ranges you might see in the length of stay.
(01:30): Our goal is to train a model that will create an arbitrary amount of synthetic data that matches the same kind of size, shape distribution, and insights of the original data. We're going to make some slight tweaks to this command, which was designed for training a model using the gravel cloud from the command line since we're running on a local box here, so let's go ahead and edit and then create our own file. We'll call it train.sh. We can copy in the example code here. We're going to make a few tweaks. So I like to look at this as one line. You can keep the config, you can go to default if you prefer, whatever configuration you like to use, but the really important thing we're going to change here is going to change from local cloud runner to local, and I prefer to use the default configs, so we're going to adjust that as well.
(02:19): Here you can see input data providing is the CSV that we just downloaded. Output, we're telling it to drop the output file and the artifacts right here, and to save the new model that's created into a JSON, so we're outputting that to a JSON file. Go ahead and run that training file. This first step, the first time you do this, it's going to take a while because it needs to download the container from the ... Here we can see the synthetic data model finished training. It generated a data set for us. We also got a quick overview from the model report of the model quality, which we can see was very good. We can download ... Actually, I believe it went ahead and downloaded any of these different records for us to take a look at.