Deploying Gretel Hybrid on AWS with Terraform
Gain hands-on experience with Gretel Hybrid on AWS, from setup to cleanup, for deploying in your environment.
Generate textbook-quality synthetic data for training LLMs and SLMs
How to get started creating high quality synthetic data for LLM training
How to use Gretel's PostgreSQL connector to automate synthetic data generation
Automating synthetic data generation using Gretel and PostgreSQL
Speaker 1 (00:00):
Hey everyone, Ben here with Gretel ai. I'm here to show you how quickly and easily you can deploy Gretel Hybrid into your own cloud environment. Gretel Hybrid allows our customers to generate safe privacy, guaranteed synthetic data within the boundaries of their own cloud tenant. This will allow you to iterate and test Gretel hybrid against your own sensitive data sets without sending any data to Gretel Cloud APIs. We're going to be running through this demo with an AWS environment, but we also support Azure and GCP. Let's go ahead and jump in and get started. First, I'm going to make sure that our prerequisite CLI tools are installed. We need the A-W-S-C-L-I, we need the Terraform CLI. I'm going to run this help command to confirm that's installed correctly and we need the Gretel CLI. I'm going to also verify with this help flag. Great all three CLIs are installed and working. Going to go ahead and clear my screen. Next, let's take a look at AWS authentication.
We want to set up appropriate AWS authentication for the AWS account we'll be using. I went ahead and took care of that already before this recording started. You can follow along with the AWS documentation or with our blog article, which will walk you through how to do that. We can run the A-W-S-S-T-S get color identity command to confirm that we are authenticated and we are connected to the right account. I went ahead and did that before this recording and so I'm going to go ahead and move on to our next step. The next step is to clone the Gretel hybrid GI repository, which is publicly available to our current working directory. You can see we did so successfully. I'm going to go ahead and change directories into the Gretel hybrid repository, specifically the AWS Terraform where we have a full deployment example that we're going to be following.
Along with, we can see there's three files here. Main TF is where our Terraform resources and modules are defined and referenced variables. TF is where our variables are specified, how we expect them to be input and Terraform TF VARs is where we actually define those variables and pass them in to the Terraform module. Let's take a closer look at Terraform Tvar and we can see what variables are defined. You can see we have things like region deployment, name Kubernetes version here. These defaults should be just fine for us. If you want to update your AWS region or change the deployment name, which will be used to name most of the AWS resources, you can do that. One call out specifically is we do need to update the source bucket name and the sync bucket name because AWS S3 buckets must be uniquely named globally even across AWS accounts.
So I'm going to go ahead and paste in two new variable definitions that I think should be globally unique. These are similar to what we had, but I've prefixed them with my first name. So we have Ben Gretel source bucket and Ben Gretel sink bucket. The source bucket will be used for staging any data that we want to train a model against and the sync bucket will be used for any output synthetic data or other job artifacts as part of training or running a model. Finally, we have the cluster admin roles and cluster admin users. If you have an IM role or an IM user that you would like to grant access to the Kubernetes cluster via the Kubernetes API. You could follow the examples here and list them out. The IM role or user that you use to deploy these Terraform modules. While you're following along, we'll already have access to the cluster by default, so I'm going to go ahead and leave these empty and we'll go ahead and move on.
Let's save our changes and we do have one other change to make in main tf. I'm going to go ahead and turn on line numbers here and you can see on line 10 that we have the AWS provider commented out. I'm going to go ahead and uncon the AWS provider so that Terraform will pull in the AWS authentication. We've already set up the same way that the A-W-S-C-L-I does. So I'm going to unco lines 10, 11 and 16 and leave the rest commented out there for a basic AWS provider definition and I will save the changes to main tf. We have one more input variable that we need to pass into our Terraform module and this is our Gretel API key. The Gretel API key is going to be passed in to the Kubernetes deployment so that the Gretel hybrid agent controller can use that API key to connect to the Gretel API.
We can pass in a sensitive value like this using an environment variable rather than defining it in a clear text file like the Terraform TF VARs file. I'm going to paste in an example here. We need the TF VAR prefix for Terraform to recognize it and then after this second underscore, the rest of the environment variable name just needs to match the variable name as it's defined in variables tf, so TF var gretel API key is what we need to utilize and then you can paste in your Gretel API key here. I'm going to go ahead and pause the video while I paste in my real API key.
Now that my real API key has been pasted in and set in that environment variable, I went ahead and cleared my screen and we're going to move on to run our Terraform commands to deploy these resources. First I'm going to run Terraform in. It is going to initialize the Terraform modules that are dependencies for the resources we're deploying. So it's pulling some of these modules down from the upstream Terraform registry and that finished successfully. So now we can run a Terraform apply and see how this goes. The first thing it's going to do is look at our resource definitions, query the A-W-S-A-P-I to see whether they're already deployed. It's going to see none of these resources have been deployed yet. Therefore, this plan has 106 resources to add and zero to change or destroy because there were no existing resources. Since this is a greenfield deployment, I'm going to go ahead and say yes, we would like to perform these actions and we will kick our deployment off. It's going to start deploying all of the resources defined in these modules, which is going to take about 12 minutes. So I'm going to pause the video now, let these resources deploy and we will proceed as soon as it is finished.
Okay, our deployment has finished. You can see that we have 106 resources successfully added, zero changed and zero destroyed. So all of our AWS infrastructure has been deployed for Gretel Hybrid. Let's go ahead and move on and run a Gretel job. I am going to export some environment variables to make things easier. We have the source bucket and sync bucket name and this matches exactly with the name that we passed in in the Terraform TF VARs file. I'm going to set these environment variables in a second terminal tab as well so that we can do a little multitasking. Let's go ahead and configure the Gretel CLI, the first option here for endpoint defaults to API Gretel cloud, which is exactly what we need. We can hit enter to select the default. Next, we want to set the artifact endpoint, which is going to be the S3 URI for that think bucket. So I already have the correct value in here from some testing earlier, but I'm going to pass this in again, we want S3 colon slash slash and then we want the name of the sync bucket.
The default runner is going to be cloud by default. We want that to be hybrid, so I will specify that explicitly. And then finally we have the Gretel API key. This is going to be the same API key. We passed in as an environment variable earlier. You can go ahead and paste that in here and default project mine is already set, but if this is empty, that is okay too. We're just going to hit enter. To select the default, I'm going to go ahead and correct a typo. I cot in my bucket definition. The sink bucket was misnamed and we're going to move on. We need some sample data, so I'm going to download a sample healthcare data. Set this command here. We're just setting the URL environment variable and curling the file. So we have downloaded this CSV locally, you can see it here, sample synthetic healthcare csv. I'm going to paste in this command that uses the A-W-S-C-L-I to copy this file to our source bucket and the file's been uploaded successfully. So let's move on to training a model with Gretel.
First, let's go ahead and set a GRETEL project name. So I'm going to use an environment variable for this. We are naming the Project Gretel hybrid deployment test and then just passing in a random suffix so that we're ensuring there's no conflicts. I'm going to echo the environment variable and you can see that it is GRETEL hybrid deployment test 5 5 6 0. So we're going to use the Gretel CLI and create this project. You can see the console link is also output as a result of this command. So you can take a look in the Gretel console, you can open that up and you'll see an empty project with no models associated with it yet.
Let's go ahead and create a model which will start a training job within our hybrid Kubernetes cluster running in AWS. I'm going to paste this command in this Gretel CLI command will create a model. We're going to use the GRETEL Act and model, which is our default recommendation. The in data will be that CSV file that we uploaded to our source bucket. The runner here is shown as manual. You could also pass in hybrid. Both of those are aliases for hybrid runners and the project is being pulled in from the environment variable that we just set.
You can see that the job has been created, the model creation has been queued. Now at this point I could pause the video and we could pick back up when the job is finished, but I am quickly just going to show you what things look like. From the Kubernetes side. We can see that the job has shifted to the pending status and a worker is being allocated. To begin model creation, I am going to clear my screen and run the A-W-S-E-K-S update, cube config command. I will pass in the region and the cluster name and this will configure our cube C-T-L-C-L-I so that we can access the Kubernetes cluster. You don't need to have the coop C-T-L-C-L-I installed to follow along. I'm just showing you what these resources look like from the Kubernetes side. While we wait for the job to finish running, I'm going to use K nines, which is a CLI tool for managing Kubernetes clusters. And when I open canines you can see that we are defaulting to the Gretel hybrid namespace. Let me go ahead and switch to our pod view here.
Okay, so our pod view, you can see the Gretel agent is in the running status here in this middle column and you can see there's a unique ID for this pod name and it's in the container creating status. This Kubernetes pod corresponds to the job that we just scheduled. Container creating means that the Kubernetes node that the pod is scheduled on is pulling the container image. So we're going to go ahead and pause for just a moment and let that image finish pulling and let that job start running. Alright, just a couple of minutes later and our job is running Now the highlight here is that this is your EKS cluster deployed within your AWS environment. So this is compute that you are providing and managing deployed by these Terraform modules. And the source bucket where we staged our dataset is also sitting within your AWS environment, meaning this container here is pulling that data directly from the source bucket.
We're training the model and then it will output the model artifacts and the synthetic data to your sink bucket and your data is never leaving your environment. We can take a closer look at the container logs here and we will see that training is in progress. We can also look back over at our Gretel CLI and we can see similar updates as training progresses. I'm going to go ahead and exit out of our Kubernetes view. You've had a glimpse into how a pod is scheduled as a result of our hybrid job. And after we do this, I'm going to go ahead and pause the video while we wait for training to finish. Okay, just a couple minutes later and our training job has finished. The model has been trained and we're ready to move forward with the rest of our demo. When a model is trained, it generates 5,000 sample records using the model just to give you an idea of what the synthetic data will look like. So we're just going to take a look at some of those records, compare them to the original dataset and then we will finish up. The first thing we need to do is just grab a little bit of extra information about the model ID that was just generated. I am going to set these environment variables. We need to grab our GRETEL project from our other shell here. So I'm going to move back over here, export Gretel project.
And now I'm going to set these other two shell variables, which are just using the Gretel CLI to pull the information about our model that was just trained and we can use those to take a look at the artifacts in our sync bucket. So here you can see the path in our sync bucket is just the Gretel project ID slash model slash GRETEL model id, and we have some artifacts from that model training job. We have the data preview, which I mentioned is 5,000 example records of synthetic data generated from that model. We also have the logs if we wanted to take a look at those. And the report, HTML is our synthetic quality report. And finally, the model itself, the output artifact from that model training job. This is sitting within your sink bucket or your artifact bucket and is not passed to gretel. So we're passing in the base act again, model that you utilize, you train it on your sensitive data and that trained model never leaves your environment.
Let's go ahead and grab the data preview and the report HTML. We'll pull them down locally and that way we can take a closer look at the data and as you follow along, you can take a look at the HTML report as well. So using the A-W-S-C-L-I, I'm going to go ahead and grab the sample data that we just saw in that list of artifacts. I'm just using the AWS S3 copy command and we copied data preview do GZ for gzipped archive over locally and we added the CSV file extension to it because that is the type of data present there. I'm going to use the gun zipp utility to go ahead and extract that. And we can see we now have data preview CSV here. I can do the same with the report. HTML. It is compressed so I am going to copy it locally and uncompress it using gun zipp.
Let's take a look at the sample synthetic healthcare data. First I'm going to go ahead and clear my screen and we can look at the CSV. I'm just looking at the first 10 rows and you can see the column names here. So first column is case id, second column, hospital code, et cetera. Let's take a look here. We can see that the IDs are in ascending order and they're unique. We can see out of the English readable words, we see some radiotherapy, anesthesia, emergency trauma and extreme. And so if we take a look at the column headers there, that's going to be the department up here. It's going to be the type of admission and the severity of illness. So let's go ahead and take a look at the data preview and we can see we have unique case IDs here. We have this department and we have type of admission severity of illness as well. So the model has trained on the data that's present, it's keeping its statistical properties, but it is privacy guaranteed synthetic data output by this model.
Finally, I've opened up the synthetics report just to give you a quick look over here and you can see the reporting that we offer on the synthetic model. It provides data summary statistics and a report on quality score, privacy protection summary, et cetera. So you can dive in and take a closer look as you follow along. I'm going to go ahead and proceed with the rest of the demo, which is just cleaning up after ourselves. So now before we destroy all our resources, this would be a great time if you wanted to take a closer hands-on look at Gretel and run it against one of your sensitive data sets. You could do so within your hybrid environment if that is a CSV file like we're already working with. Or if you'd like to take a closer look at the Gretel relational documentation and connect to a relational data source, you're also able to do that. Once you're done taking a closer look at Gretel, we can clean up our cloud resources to save on costs.
So to destroy our resources, Terraform makes it extremely simple. The first thing we want to do is clean up our S3 buckets because bucket deletion will fail if the buckets have objects present in them. So I'm going to go ahead and use the A-W-S-C-L-I to clean these buckets up. We're going to use the AWS S3 RM command with the recursive flag to delete all objects present in both buckets. Now that that's done, all we have to do is a Terraform Destroy from the same directory that we ran the Terraform apply in the full deployment current working directory. It's going to assess the current state of all deployed resources and then it's going to give us the plan. Once again, since we're running the destroy command, we can see we have zero resources to add or change and 106 resources to destroy. I'm going to go ahead and say yes, and we will let the resources be cleaned up automatically by Terraform.
That concludes everything for this demo. We, in summary, have walked through deploying Gretel Hybrid onto AWS using Terraform. We did so very quickly with minimal management overhead by defining our variables and plain text file and utilizing Terraform our configured AWS credentials to deploy and automatically create all the resources. The hybrid Terraform modules that Gretel provides are a great way to get your hands on in a testing environment like this, but they're also built with production environments in mind and can be utilized to operationalize Gretel as you move forward adopting it and deploying it into your own cloud. So thank you for following along and happy coding.