Bring Your Own Cloud (BYOC): Transforming & Synthesizing Data with Gretel Hybrid

We anonymized a massive 7 terabytes of data in a hybrid cloud environment with Gretel Transform. Here's how we did it.

Background

The challenge

Gretel has a snappy SaaS offering that runs in the cloud, but some of our customers (particularly those working in heavily regulated industries like banking and insurance) have stringent data residency requirements. They can’t simply move their data into our managed cloud offering. The data and models have to remain within their cloud throughout the process. This is a problem that we have solved in a couple of different ways. 

Our insight đź’ˇ

Our initial offering is a single machine setup that uses Docker as the workhorse for the on-premise component of the model training and data generation. We call this setup the “Gretel Agent.”

While good for smaller workloads, larger datasets would take a while to process on a single machine (months to years). Also, this setup introduces a single point of failure. (Hopefully Tom doesn’t kick the power cable on the Gretel Agent!) For that reason, it’s being phased out in support of our new approach.

We settled on a Kubernetes deployment to be both cloud agnostic and to enable us to scale our hybrid workflows based on the number of nodes we can run in a customer’s environment. We call this our “Gretel Kubernetes Agent.” It’s currently in closed beta, so reach out to us for more information on getting started.

Both the Docker and Kubernetes-based options represent the on-premise piece of what we call “Gretel Hybrid.” 

The main idea is that customers can use Gretel’s code and our scheduling APIs, but only metadata goes back and forth between their cloud and Gretel’s. The data, the trained models, and the synthetic data all remain within your cloud during the process.

Setting the stage  

For this use case, the customer had a cloud database (7TB), and they wanted to be able to use this data in a lower environment. A lower environment would be something like a performance testing environment, where changes can be verified before going out to the production environment. This enables testing to be reflective of the type/shape of data as well as the scale, something that is hard to do by simply inserting a bunch of generated or duplicate rows.

The main concern is exposing PII (Personal Identifiable Information, e.g. names, phone numbers, and emails) to developers or other folks who have access to the lower environment.

In addition to allowing you to generate an entirely synthetic dataset, Gretel also offers a Transform package which makes it easy to remove or alter sensitive fields while keeping the rest of the data the same. This is useful for lower environments, where data will not be shared externally. We used that as part of this hybrid data pipeline.

So how did we transform 7 TB of data? Let’s get into it.

The Kubernetes deployment

Diagram demonstrating Gretel's hybrid deployment workflow with a customer’s database.

After a few iterations with the customer, we got our helm chart installed just the way they wanted it. For those who haven’t worked with Kubernetes much, a helm chart is one of the mechanisms to package and redistribute software within Kubernetes clusters, while allowing customer-specific customizations. We had a very simple chart to start with, for folks who wanted to run it in a cluster specifically set up for model training and data generation, but we added more hooks to the chart to allow the customer to install it in a multi-purpose Kubernetes cluster. The main lesson learned was that sensible defaults paired with options tailored to the customer’s specific needs make for the best installation strategy.

These hooks meant adding options to set the appropriate Node Selectors (i.e., a way to select which Nodes to use by label), Taints/Tolerations (i.e. a way to designate which nodes are usable by which jobs), and resources for the Gretel Kubernetes Agent and the workers that it spawns. The end product was simply a `helm install` that a customer could run on their shared cluster.

The above diagram shows how this helm chart allows us to control scheduling from within Gretel Cloud, while keeping all the data within a customer’s cloud. 

  • The customer will prepare data by putting it into a source bucket that the Kubernetes workers can access (we’ll get to options for how to load that data in the next section). 
  • Once the data is present, we can use Gretel’s APIs to orchestrate training and data generation jobs. We pull in our proprietary container images to run the workloads within your cloud environment, enabling us to push out enhancements and fixes as well. The trained models and the output data will all be placed in the customer’s sink bucket, keeping the data right where they want it.

The Kubernetes Gretel Agent is a process that polls and processes jobs scheduled via our APIs. The Agent can schedule as many jobs in parallel as the customer’s plan and environment allow.

But before we can start running these workloads, we have to prepare the source buckets.

Preparing the data

Along with Kubernetes, the common denominator for our deployment strategy is object storage (S3, Azure, GCS, etc). 

The Gretel training and transform jobs will need access to a bucket as part of the deployment, and the helm chart has hooks for passing in credentials or a Kubernetes Service Account, so the jobs can access the source and sink buckets. One of the key pieces to getting this working is to test the policy setup from within the cluster (by running a sample job) and by using things like AWS’ policy simulator, to ensure the service account will theoretically be able to access any buckets it needs.

Gretel Relational allows us to take the data directly out of a source database (RDBMS), but in this instance, we used a utility that exported the dataset into parquet files. Gretel supports csv, jsonl, and parquet, along with a built-in S3 connector that makes it easy to set up data sources in instances where direct access to the database might not be available. 

Once we had the data in our source bucket, it was simply a matter of setting up some scripts, which scheduled the jobs that would run in the Gretel Kubernetes Agent deployment.

Running the training and transforms

While scaling up the workloads, we fixed some issues that limited how quickly the Gretel Kubernetes Agent could place jobs, as well as worked with the customer to change the CIDR block range, since we were running out of Pod IPs from running so many jobs. CIDR, or Classless Inter-Domain Routing, defines the range of IP addresses that you are allowed to use.

Once we got over those scaling hurdles, we were able to run 100 concurrent workloads in parallel for this customer, getting through 1000s of these files every hour. The scheduling was all done via Python scripts that used our gretel-client SDK to schedule the jobs that the on-premise deployment would poll. 

Because everyone’s data is different, we had to tweak the Transform config for certain tables, but for the most part we could use the same configuration to transform PII into fake data across many tables. We also made some improvements to our transform APIs to help with future customers, such as handling entity recognitions that can take a very long time, or enabling a few other ways to fake different data types.

Once we had all of the rough edges ironed out, we were ready to load the data into a lower environment.

Giving the transformed data a new home

The end goal of this journey was to enable production-like data in a test database. The last hurdle we covered by using the gretel-client SDK once again, this time to locate all the transformed files per table (our mapping was 1-to-1 of table to model) and load them into the performance database.

Once we had the access necessary to the sink bucket and the test RDBMS instance, it was straightforward to load the parquet files in with Pandas and SqlAlchemy. The 7TB of transformed data had found a new home.

What have we learned?

Gretel lets you transform data in your own cloud using Kubernetes and object storage as the common ground.

We’re always looking to improve this setup and hope to have new customers looking to generate and transform data in hybrid environments with us, so that we can offer the best tools to transform data in a way that enables our customers to innovate safely with data while respecting their sensitive information and complying with privacy regulations.

If you’re interested in what it takes to set up hybrid environments, check out our documentation on setting up Gretel Hybrid.