Create Synthetic Data with the Gretel CLI
In this tutorial, Alex shows you how to create synthetic data with the Gretel CLI.
Generate textbook-quality synthetic data for training LLMs and SLMs
How to get started creating high quality synthetic data for LLM training
How to use Gretel's PostgreSQL connector to automate synthetic data generation
Automating synthetic data generation using Gretel and PostgreSQL
Hey, I'm Alex. I'm a co-founder here at Gretel and today I'm going to walk through using the Gretel CLI and APIs to generate synthetic data. Let's jump on in. First thing you'll want to do is to go ahead and log into the Gretel console. We use this to generate an API key, which we will connect to the CLI so it knows how to interface with our service. Next step is to go ahead and install the Gretel CLI, simple PIP command. It requires Python 3 or greater. That's setup, from there you want to go ahead and make sure it's setup correctly, just typing gretel ought to show you an overview of the different commands that you can execute. Running gretel configure will allow us to go ahead and set up the defaults. Go ahead and accept the defaults we're going to be training our model in the cloud. You also have the option of training a model locally where your data never leaves your environment if you wish. I'm going to go ahead and grab our API key and paste that in here.
And it looks like we are good to go. From there let's go ahead and jump over to the docs and we'll follow along with the docs in the example for creating synthetic data. CLI tutorials create synthetic data. From here the first thing we'll do is create a project, so we're going to create a default project, which is the landing place for our models and the data we create within the Gretel SaaS Service. Take a look at this command here, gretel projects create. We're giving it a display name called healthcare, which is meant to just kind of give us a description for the type of data set we're working on. We're also setting that as the default. So you don't have to specify that in the next commands. That was created correctly.
Next thing we're going to do. You can't actually just call Gretel directly on this file that's located in the cloud, but I wanted to pull it down and take a look and take a look at the data before I started training. So we use wget, we'll pull down this example, EHR data that we're going to create a synthetic version of. We went and downloaded that from S3, let's take a look. What we see here are about 20 different columns of data, got nice mixed integer, categorical, free taxed numeric data so it's a good challenge for any synthetic data system. And our challenge here is to train a neural network essentially on this text, in this, essentially we're training the language model that learns to recreate this text. And we look at whether it's capable of learning the intext and correlations that exist in the input data and recreating that into a new artificial synthetic dataset.
Okay, so we've got that downloaded. The next step here is we are going to go ahead and train a synthetic model on our dataset. Let's take a look at this command. You can go ahead and paste this in. What we're running here are gretel models create, so essentially you're telling it to train a new model within our project. We're running in the cloud, we're using a configuration here. We're specifying configuration set. You can use the default set. We're using one here called high field count, which works well for this data set as it had about 20 fields in it. The input data is hospital EHR data, the output is to the local file system. And then we're storing the model ID to a file that we can use subsequently for calls to generate records.
We mentioned here that the high field counts common config is the one that we're going with. You can browse different configurations or even make your own here. So options would be to view any one of these configurations or make small changes to it, fine tune them, or create your own. One of the popular ones we see often with our customers is training with differential privacy. So here you can see an example of a YAML Config, turning on differential privacy, and specifying the default primers we use to instruct training. From here we're ready to start training so let's go ahead and kick this off.
Preparing the model, it's uploading the data source that we have to our project in Gretel. It's creating the project, queuing model creation. So what it's doing now is looking for a worker which will pick up this project. A worker is a GPU-enabled box that will essentially start training our neural network for us within the Gretel cloud. The model has completed training, it went for the full 100 epochs without using early stopping, so it was continually proofing on the validation set. You can see here a quick look at it shows that the accuracy is quite good. 0.8 on accuracy and validation loss, typically anything under one you've got a pretty solid model, so we could expect this to perform pretty well. The next step here it's going to start generating records. So here we see each different batch the number of records here, 365 generated in the first batch, five failed validation.
So we train a set of expert validators that look at the training data, and they look to make sure that anything created by the neural network kind of matches the semantics and the types of data that we saw in the training data. If it fails, it fails to record, which really high confidence in the synthetic data that's been created. So here it completed the 5,000 records we asked it to create and went ahead and generated a report. And go ahead and take a look at that report now and try to get an idea of how good of a job or how synthetic is my synthetic data.
And open up the report. Here we can see our model got an excellent quality score. So really what this report allows us to do is deep dive on the data. See how many of the original training lines were duplicated for example, get some idea around privacy, as well as the insight from distribution of the original data. What I like to look for, I like to typically start with looking at correlations in the data, and here we can see that our neural network appeared to have learned and recreated the correlations and the original data set very well. If this is something you're not seeing enough of, I would recommend adding additional records, which helps quite a bit. Sometimes in increasing the complexity of the neural network will help as well.
Second graph I really like to look at is the PCA. What PCA does is it compresses really highly dimensional data. You do the 20 different columns we have into a two dimensional plane. And what I like to do is really just kind of compare the shape of the outfit sets here and make sure that they're similar, right? So this shows that we didn't overfit and kind of focus on a few, learning a few models, rather we would like to see kind of a uniform distribution across both different data sets. From there you have the ability to dive in on a field by field level. And look at the example here, available rooms in the hospital or admission deposit. And we're going to look at here is the distributions of elements we're seeing in the synthetic dataset versus the training set that it was based upon. So this really gives you a high confidence view before you start to use your model to generate large amounts of data about the quality of that model, and how well it would work for your use case, whether you're doing downstream ML, or you're seeking to balance dataset or a creative reproduction environment. So you're happy with that, let's go ahead and take a look at the data proofing we did.
So from here we see our data preview gzip. And what we have here looks like a really nice synthetic data set. That's created, we generated 5,000 records. Next we'll go back to our documentation and we'll generate some more data. So now we'll use a Gretel records generate command. We'll also point it at that model_data.json that we created earlier. So go ahead and look at this command again, this time we're using Gretel records generate, we're passing for the model ID, we're passing it the file that we stored the model ID to. Running it in cloud, it's going to generate 5,000 records. You're going to see the task being sent to the cloud, cloud worker being identified. Now you have it down generated additional 5,000 records, the model can generate an unlimited amount of data. For next steps going through this tutorial, I would suggest fine tuning the configuration that we used earlier to see how you can get the maximum synthetic data quality score for your use case.