Generate Synthetic Data Using Gretel Hybrid

A comprehensive guide on deploying Gretel Hybrid in an AWS environment to generate high-quality synthetic data within 30 minutes.

One common characteristic of production data, regardless of industry or sphere of business is this: data is sensitive. Customer names, addresses, payment information, and other personally identifiable information (PII) are stored in production databases with strict security (and often regulatory) requirements. Gretel provides a way for customers to keep sensitive data within the boundaries of their own cloud tenant through Gretel Hybrid, which has been live and in use by customers for several months now. Customers are able to train models and generate privacy-guaranteed synthetic data from infrastructure they own and maintain, and their data never leaves their environment.

Kubernetes is the core compute service powering Gretel Hybrid. We support a “bring your own compute” approach across 3 major cloud providers–AWS, Azure, and GCP–via their managed Kubernetes services (EKS, AKS, and GKE). Terraform is a cloud agnostic infrastructure as code tool we’ve chosen to enable rapid hybrid deployments and feature delivery. Recently Gretel has released Terraform modules which empower customers to deploy Gretel Hybrid faster than ever.  Our publicly available modules allow a user to deploy a “sandbox” or “proof of concept” environment in less than 30 minutes. They are also production ready and can be utilized by customers who are ready to fully operationalize Gretel Hybrid.

In this post, we will walk you through the steps to deploy Gretel Hybrid into your own AWS environment and generate some synthetic health data using a demo dataset. This will allow you to iterate and test Gretel Hybrid against your own sensitive datasets without sending any data to Gretel Cloud. Stick with us and follow along!

Want to know more about Gretel Hybrid before testing in your own environment? Check out our Gretel Hybrid documentation here! We have Gretel Hybrid deployment instructions for Azure and GCP. We detail exactly what metadata is sent to the Gretel API from your environment and we also provide a detailed architecture diagram.

Resource Overview 🔍

Here is the dependency diagram showing our available Terraform modules for Gretel Hybrid on AWS. We will be deploying all the below resources as part of this example, but you’re also able to utilize the individual modules in case you already have an existing VPC or EKS Cluster.

Figure 1.  A layout of the included Terraform modules for Gretel Hybrid on AWS.
  • network - Deploys the preliminary VPC which will be used by the Gretel Hybrid EKS Cluster.
  • cluster - Deploys the EKS cluster and a default node group to run kube-system resources.
  • cluster_addon_nvidia_driver - Deploys the Nvidia Device Driver helm chart for GPU nodes. This is required for Gretel Hybrid jobs to utilize the GPU attached to GPU based instance types.
  • cluster_addon_cluster_autoscaler - Deploys the cluster-autoscaler helm chart for scaling node groups based on Kubernetes workloads. This allows us to minimize compute costs when Gretel Jobs aren’t running.
  • node_groups - Deploys 2 EKS Node Groups. One for Gretel CPU workloads and one for Gretel GPU workloads. You can see which Gretel Models require GPUs here.
  • cluster_auth - Manages the aws-auth ConfigMap for the EKS cluster.
  • gretel_hybrid - Deploys the two necessary Gretel Hybrid S3 Buckets as well as the Gretel Hybrid helm chart.
AWS costs a concern? Any EC2 instances created by these modules will autoscale. The associated autoscaling groups are configured with a minimum size of zero. This means instances will only be created and running when pending Gretel Hybrid jobs exist. The total cost to deploy this infrastructure, run a Gretel Hybrid job, grab a coffee while the job is running, and delete the infrastructure afterward should be under $1! Minus the cost of your coffee of course. ☕

Walkthrough

As a prerequisite for following along, you will need an AWS Account with the proper permissions. Your AWS IAM User or IAM Role will need to be able to create and manage all of the resources laid out in the previous section’s diagram. For following along with this guide, we recommend using an AWS Sandbox account with admin level access.

1. Install Prerequisite CLI Tools ⌨️

First, we need to install the AWS CLI, the Terraform CLI, and the Gretel CLI. Here’s how:

AWS

Official installation instructions for Linux, MacOS, and Windows are available here. Confirm the CLI has been installed correctly by running the following command from your terminal.

aws --version

Terraform

Official installation instructions for Linux, MacOS, and Windows are available here. Confirm the CLI has been installed correctly by running the following command from your terminal.

terraform -help

Gretel

The most common way to install the Gretel CLI is using pip. You may run the below command to quickly install it if you already have pip installed. If you need help installing pip, please consult the official pip documentation here.

# Install the Gretel CLI
pip install -U gretel-client

# Confirm the installation was successful
gretel --help

curl and jq

We make use of curl to download a CSV file with some demo data to your local machine. We utilize jq to filter the json response from some Gretel CLI commands. If your operating system does not come with these utilities installed by default you will need to install curl and jq.

2. Deploy Gretel Hybrid on AWS ⚡

Now we’re ready to deploy Gretel Hybrid on AWS. 

Authenticate with AWS

From your terminal, start by configuring your AWS CLI to use the appropriate credentials for your sandbox account. You may follow the official documentation here to authenticate your shell session appropriately. Your organization may have specific guidance on how to access an organization owned account. You can confirm proper AWS access by running the below command and verifying its output.

aws sts get-caller-identity

Clone the Gretel Hybrid Repository

Clone the Gretel Hybrid git repository with the following command.

git clone https://github.com/gretelai/gretel-hybrid.git

Change directory into the full-deployment example.

cd gretel-hybrid/terraform/aws/examples/full_deployment

Here is what our working directory looks like.

full_deployment
├── main.tf
├── variables.tf
└── terraform.tfvars

Configure Terraform Variables and Provider

Open the terraform.tfvars file and edit the list of variables as desired. You must customize the “gretel_source_bucket_name” and “gretel_sink_bucket_name” variables since all AWS S3 Bucket names must be unique globally.

region                	= "us-west-2"
deployment_name       	= "gretel-hybrid-env"
kubernetes_version    	= "1.27"
gretel_source_bucket_name = "unique-gretel-source-bucket-name"
gretel_sink_bucket_name   = "unique-gretel-sink-bucket-name"
# Provide any IAM users or roles which should be allowed to run "aws eks update-kubeconfig" to gain access to the cluster
cluster_admin_roles = {
  # Format: "" = ""
  # Example: "adminrole" = "arn:aws:iam::012345678912:role/cloud_team_admin_role"
}
cluster_admin_users = {
  # Format: "" = ""
  # Example: "poweruser" = "arn:aws:iam::012345678912:user/cloud_team_admin_user"
}

After customizing the bucket names and any desired variables you may move on.

Edit the main.tf file and add the following on line 10 to define a blank AWS provider for Terraform which will cause Terraform to use the default AWS credentials which you have already configured.

provider "aws" {}

Setup Gretel API Key

If you haven’t already, sign up for a free Gretel account at https://console.gretel.ai/. 

You may directly retrieve your API Key after signing up by visiting this link and copying the key.

We must pass the API Key to Terraform for deployment. We don’t want to define a sensitive value in a cleartext file, so we will pass this variable to Terraform using an environment variable. Replace the below text inside the quotes with your copied Gretel API Key.

export TF_VAR_gretel_api_key=""

Deploy Gretel Hybrid on AWS

Run these terraform commands from the full_deployment directory.

Initialize terraform. This is an idempotent operation and is always safe to do (resources will not be created/destroyed).

terraform init

View the changes Terraform will make upon deployment. Use this any time you make changes to take a closer look at what is going on.

terraform plan

Deploy the module. This will require user confirmation so don't walk away from your shell until you confirm by typing "yes" to start the deployment.

terraform apply

It will take 10-20 minutes for all the necessary resources to be deployed. Congratulations! You've deployed everything necessary to run Gretel Hybrid within your own cloud tenant.

3. Generate Synthetic Data Within Your Cloud Environment ☁️

We’re almost ready to start synthesizing data.

Configure the Gretel CLI

Since we will be running a test job with the Gretel CLI, we must first configure the Gretel CLI appropriately. Run the below command.

gretel configure

The configuration process will take you through step by step prompts to configure your Gretel CLI. You may hit “enter” to use the value in square brackets included in the prompt.

  1. For Endpoint hit enter to keep the default value.
  2. For Artifact Endpoint we will pass in the S3 URI of the gretel_sink_bucket_name from terraform.tfvars. The value will look like this: s3://gretel_sink_bucket_name. Make sure you update the value to point at your customized sink bucket name.
  3. For Default Runner enter hybrid
  4. For Gretel API Key pass in your Gretel API Key which we copied earlier.
  5. For Default Project hit enter to keep the default value.

The values should look like this.

Endpoint [https://api.gretel.cloud]:
Artifact Endpoint [cloud]: s3://gretel_sink_bucket_name
Default Runner (cloud, local, hybrid) [cloud]: hybrid
Gretel API Key [grtuf6c5****]: grtu**************
Default Project []:

Upload Sample Data to Your Source Bucket

You’ll need to upload the sample data to the source AWS S3 Bucket using the AWS CLI. You can do so with the following command. Make sure you change the value being set for the SOURCE_BUCKET variable to match the “gretel_source_bucket_name” you specified in your terraform.tfvars file. The sample healthcare CSV is located at this link.

# Update this to match gretel_source_bucket_name in terraform.tfvars
export SOURCE_BUCKET=unique-gretel-source-bucket-name

# Download the file
url=https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/sample-synthetic-healthcare.csv
curl -o sample-synthetic-healthcare.csv $url

# Upload the data to the source bucket
aws s3 cp sample-synthetic-healthcare.csv s3://$SOURCE_BUCKET

Create a Gretel Project

Gretel Projects can be thought of as repositories that hold models. Projects are created by single users and can be shared with various permissions. Run the below command to create a project.

export GRETEL_PROJECT="Gretel-Hybrid-Deployment-Test-$RANDOM"
gretel projects create --name "$GRETEL_PROJECT" --display-name "Gretel Hybrid Test"

Generate Synthetic Data

We will take advantage of the GPU configuration within your Gretel Hybrid deployment and run our Gretel ACTGAN model against the demo data. Run the below command to schedule your Gretel Job. This can take anywhere from 5-15 minutes to complete.

gretel models create --config synthetics/tabular-actgan \
	--in-data s3://$SOURCE_BUCKET/sample-synthetic-healthcare.csv \
	--runner manual \
	--project $GRETEL_PROJECT

Review Sample Data 

Now that your Gretel Hybrid job has finished running, we can take a closer look at the sample synthetic data that was generated as part of the model training job. Make sure you set the SINK_BUCKET variable appropriately with the value from your terraform.tfvars file.

# Set your sink bucket variable
export SINK_BUCKET=unique-gretel-source-bucket-name

# Get the model and project IDs with the CLI
export GRETEL_MODEL_ID=$(gretel models search --project $GRETEL_PROJECT | jq -r '.[].uid')
export GRETEL_PROJECT_ID=$(gretel models search --project $GRETEL_PROJECT | jq -r '.[].project_id')

Now we can list the output artifacts from our model training job.

# List the artifacts generated as part of a Gretel model training job
aws s3 ls s3://$SINK_BUCKET/$GRETEL_PROJECT_ID/model/$GRETEL_MODEL_ID/

We will copy the sample data as well as the synthetic data quality report over to our local environment.

# Copy and unzip the sample data
aws s3 cp s3://$SINK_BUCKET/$GRETEL_PROJECT_ID/model/$GRETEL_MODEL_ID/data_preview.gz ./data_preview.csv.gz
gunzip data_preview.csv.gz

# Copy and unzip the quality report
aws s3 cp s3://$SINK_BUCKET/$GRETEL_PROJECT_ID/model/$GRETEL_MODEL_ID/report.html.gz ./
gunzip report.html.gz

You can open the quality report with your browser of choice to take a closer look at the insight Gretel provides when you train a synthetic model. You can compare the data_preview.csv which is a small set of synthetic data against the source demo dataset contained in sample-synthetic-healthcare.csv.

(Optional) Generate More Synthetic Data

Want to generate more synthetic data? You can run the following command and choose the number of synthetic records you’d like to generate based on the previously trained model. This will result in another Gretel Job being scheduled to run within your Kubernetes cluster and may take ~5 minutes.

# Set the number of records to generate
export NUM_RECORDS=500
gretel records generate --project $GRETEL_PROJECT --model-id $GRETEL_MODEL_ID --num-records $NUM_RECORDS

When the job finishes running we can take a closer look at the output artifacts using the below commands. First let’s list the run IDs within the sink bucket.

# Find the ID for the record generation job
aws s3 ls s3://$SINK_BUCKET/$GRETEL_PROJECT_ID/run/
                       	PRE 6538035a18026d0f65c5fb91/

Now we will pass that ID into a variable as shown below. We will copy the synthetic data the same way we have done in the previous section.

# Set this based on your output from the previous "aws s3 ls" command
GRETEL_RECORD_GENERATION_ID=6538035a18026d0f65c5fb91

# Take a closer look at the available output artifacts from the generation job
aws s3 ls s3://$SINK_BUCKET/$GRETEL_PROJECT_ID/run/$GRETEL_RECORD_GENERATION_ID/

# Copy the synthetic data to your local environment
aws s3 cp s3://$SINK_BUCKET/$GRETEL_PROJECT_ID/run/$GRETEL_RECORD_GENERATION_ID/data.gz ./data.csv.gz

# Unzip the compressed data
gunzip data.csv.gz

Now you can compare the data.csv dataset against the source demo dataset contained in sample-synthetic-healthcare.csv. 

Cleanup 🧹

If you would like to clean up the AWS resources you provisioned while following along, the following commands will delete all provisioned resources. Run them from the full_deployment directory. The destroy command will ask for confirmation before proceeding.

# We must delete all objects in the S3 buckets before the buckets themselves may be deleted
aws s3 rm s3://$SOURCE_BUCKET --recursive
aws s3 rm s3://$SINK_BUCKET --recursive

# Destroy all AWS resources created by this Terraform module
terraform destroy

Wrapping Up

In this walkthrough we created all of the AWS resources necessary to deploy Gretel Hybrid and generate synthetic data, while keeping data within your own cloud tenant. We demonstrated the  process using a sample healthcare dataset for model training. You can check out our Gretel Hybrid documentation for further reading, or watch this video tutorial!

Now that you’ve tried Gretel Hybrid, the next step is to try it out with your own data. You can be confident that your data remains secure in your environment , and is never passed to Gretel Cloud at any point. You can export a dataset as a CSV and follow along with the above guide, replacing the sample dataset with your own.

In addition to regular tabular data Gretel also supports relational data sources. Check out the Gretel Relational docs which are fully compatible with Gretel Hybrid.

Thanks for following along and happy synthesizing!