Redact PII from data with the Gretel CLI
In this tutorial, Alex walks you through how to redact PII (personally identifiable information) from a dataset using the Gretel CLI.
Generate textbook-quality synthetic data for training LLMs and SLMs
How to get started creating high quality synthetic data for LLM training
How to use Gretel's PostgreSQL connector to automate synthetic data generation
Automating synthetic data generation using Gretel and PostgreSQL
Hey, I'm Alex. I'm a co-founder at Gretel. And today we are going to use the Gretel CLI to walk through identifying and transforming or redacting sensitive data. Let's go ahead and jump on in.
First thing I want to do is to go ahead and log into the Gretel console. From here, you can generate an API key that we can use to connect the CLI that's running on our endpoint or our remote machine here to the Gretel service. Go ahead and create that.
Next thing you want to do from your terminal window, if you haven't done that already, is go ahead and install the Gretel client. pip install gretel-client. Next thing I want to do is run Gretel configure. Very similar to the AWS CLI we can go ahead and accept most of the defaults here. We're going to go ahead and build in the cloud. You have the ability to deploy as a container and run locally, if you wish.
We'll go ahead and grab our API key here one more time. Paste it in. Select default project. Looking good.
From here we'll go ahead and follow the documentation on the CLI tutorial for redacting sensitive PII. Okay. And I start to restore things in. We're going to create a configuration file as our first step that tells our APIs how we wish to train the model.
We're going to name this redact_pii.yaml, And go ahead and take a look through this. So we are specifying a model creation, a transform type. What that means is that we're going to be mutating the record. So not only labeling data, but transforming the labels that we found, giving it a quick descriptive name called remove PII. What we're doing here is setting a set of rules. So we're searching for person name, credit card number, and so on, including a custom regular expression we find down here at the bottom. Whenever any of these are found, we're going to transform that data. When possible, we used a fake data generator. So we'll replace a name with a name or an address with an address. If no fake name generator is possible, we will simply redact that data with the character. This is a big use case for developers that are training machine learning models on customer chat records or things like that, where you want to maintain the semantics of the data, but you don't want to use any actual customer-sensitive information.
Let's go ahead and save this. Next thing it's going to ask us to do is to create a pii.csv. So we can see this. We'll just copy this into our new file. Here we see ID names, emails, phone numbers, and so on. So we're going to really measure our ability to detect and to transform this data. And the tutorial, the next step here is we're going to use the Gretel CLI to create a new project. You can think of a project as a holding place for your models and your data that are stored in the Gretel cloud. Let's go ahead and copy this in. Then we're going to create one called redact PII. We're going to set this is as our default project.
Okay. Since it's our default project, we no longer need to specify it. So we will start the first process here, which is to create a Gretel model here based on training on our input data. So I'll go ahead and copy this in creating a model. We're using our redact_pii.yaml. It's going to examine our input file here and essentially create a transformation model you can use for subsequent transformations of similar data types. We're going to use the cloud. We're going to output the Model ID to something called model_id.json which makes it really simple for us to reference it later. Let's go ahead and kick that off.
So from here, it's uploading our data to the Gretel cloud. If you choose to run a local runner and you have Docker installed, it will spin up a local container and run against that. Searching for cloud worker. It's beginning to model training. It should happen pretty quickly with such a small data set. Okay it's done. Take a look at what's been created in our environments. We have a model_id.json. Now that we've trained the model, we can essentially pass into any amount of data that we want to, to the model and have it both labeled and then transform the data.
So it's going to take a look this. Now we're going to use the Gretel records command, going to tell us to transform the data using Model ID, model_data.json, So we don't have to memorize the ID here. Using input data pii.csv and telling it to output locally.
Okay. Project is done let's go ahead and look at the data. It downloaded the results here are transformed data set to something called data.gz. So we're going to have some commands here to make it easy to visualize the results. And here we can see our transformed data set looks significantly different than the original data here. So we see fake IDs being put in here. We see email addresses, fake phone numbers, fake visas, fake social security numbers.
We did not have a faker for the last column here, this user ID. So essentially the transformed results from this pii.csv have been turned into X's since we didn't have a fake data generator for that. If you wish to change this, one of the things you want to experiment with is using a fake data generator for all or using redact with all. So you could simply change your redact_pii.csv Here, and instead of using faker, you could do redact for all.