Setting up your Gretel environment (4/4) - Generate synthetic data locally from the SDK
In part 4 of this video series on setting up your Gretel environment, Alex walks through the steps to generate synthetic data locally from the Gretel SDK.
Generate textbook-quality synthetic data for training LLMs and SLMs
How to get started creating high quality synthetic data for LLM training
How to use Gretel's PostgreSQL connector to automate synthetic data generation
Automating synthetic data generation using Gretel and PostgreSQL
(00:05): Hey, in this example, we're going to use Gretel's SDK to train a local model on a box that could be located on your workstation or in the cloud, connected to a GPU. First step, is to go ahead and log into the box here. We need to install... Preferred way to interact with these different models is to use the Jupyter Notebooks. So, we're going to go ahead and install Jupyter. We're going to tunnel through to this instance that sits in the cloud and then we'll use that to train a synthetic model. So, we've got a directory set up with our example input data here.
(00:30): We've just used in the previous video to train a model using the Gretel CLI. Now we're going to go ahead and set up Jupyter. So, we'll go ahead and pick Install -U Jupyter. That's up and running. We need to set up a way to tunnel into this box. So, we're going to exit out and I'm going to use a kind of a handy shortcut on a Mac or a Lennox box that you can use to automatically create a tunnel into the cloud. So, we'll go ahead and open vi HOME/.SSH/config.
(01:00): What you can see here is for my cloud box here I've added a special command here, which is local forward. What that's going to do is going to take the remote box port 8888, which is where Jupyter run by default and forward it to my local machine, my Mac, in this case on port 8899. So, as soon as I tunnel in there now, it will allow that tunnel to exist.
(01:30): Great. So, let's go to our synthetic data directory and launch our Jupyter Notebook. I've fired up the notebook here by default that uses an authentication token, which you want to grab. This port should now be accessible from my local instance. So, we'll connect at 127.0, web host 8899. It's going to ask for a token, so we'll use the token that was generated when Jupyter first started.
(02:00): And we are in. So, next step here is we want to upload this example notebook for local synthetics training. I got a couple ways that you could do that. Simplest way, I think, either going through GitHub or you can just go to Colab and save the notebook. So, we'll go ahead and do that here. I'll open up the local synthetics notebook in Colab. We don't want to be running the Colab service, so we will go ahead and just download this file and we'll upload it to our Jupyter Notebook server.
(02:30): Great. So, go ahead. Connect here. Open up this notebook and we'll take a quick look at what's happening. First, we have some dependencies that we need to install. So, I'm going to go ahead and install these guys. Since we're using a plain Python environment, we probably don't have pandas or a PMO installed yet. Go ahead and do our package installs.
(03:02): Going through here, we can see what's happening. It's loading from a remote data source. So, it's loading our US adult income data set from S3. You can point this to any URL or file inside of your local environment. It's going to export that to a training CSV. It allows you to make some changes to it if you wish, some basic configuration options that we can modify here as well. It's going to create a project and it's going to go ahead and train a synthetic model on that project by submitting it to Docker and, in which case, we'll be able to grab the synthetic data or the model or the report or any of those types of artifacts from the model as soon as it's done.
(03:40): So, now we'll go ahead and just do Restart and Run All and this should run through the entire notebook. Here you can see a preview of the data set, so this time we'll be synthesizing the US adult income kind of classic machine learning data set that's used to predict US adult income based on US census data.
(04:00): Moving down a little bit here, we see it loading our configuration. So, we're loading it from a configuration template that sits in the Gretel cloud and go ahead and take a look at that if you like. The Gretel blueprints repository inside GitHub. Here you can see we're making a few configuration changes. We're telling it only to train for 50 epochs and the name of the data source we want to use is a local file called training_data.csv. I can see it starting up here.
(04:30): Validating the configuration is correct. Submitting the model to our local box, which is sitting here in the cloud or GCP for training. Here, we can see it's instantiated a local worker. So, this is training on a remote box currently. It's going to do 50 passes roughly over the data and it'll use that to generate a new dataset. The grand privacy filters on it. I didn't feel a need to remove any data.
(05:00): So, here we see a total of 5,000 records generated. It's saving the model, so we can generate an arbitrary amount of new data and also downloading the 5,000 sample records that we just created. So, here is a first look at our sample synthetic CSV. Here, we can look at the report. This doesn't always render correctly inside a Jupyter Notebook, but you can download the HTML or look at that on your local instance to be able to see everything but, from the overview here, we can see that the score was actually very good.
(05:35): And we're going to call the model again, just as an example of how to generate more data with a model that's already been trained telling it to generate an additional 100 records. It looks like that. It is just firing up the generator right now and will finish in a few seconds. And there we have it, an additional 100 records and we have trained and generated data on a local box.