Classify PII with the Gretel CLI

Video description

Apr 20, 2021

In this tutorial, Alex walks you through the steps needed to classify PII (Personally Identifiable Information) with the Gretel CLI.

Read the blog post


Alex (00:05):

Hey, I'm Alex. I'm the co-founder at Gretel. And today we're going to walk through using Gretel's APIs to label sensitive information in your CSV or log files.

Alex (00:13):

Let's go ahead and jump on in. First thing you'll want to do is go ahead and get logged into the Gretel console. Make sure your Gretel CLI is set up. So we'll go ahead and follow the instructions for installing CLI right now. Simple PIP command. Install Gretel Client. Should already be set up here. Okay. Now that CLI is set up, you'll want to run Gretel Configure. This will walk you through setting up the different API endpoints and things that you're going to hit. Go ahead and hit "defaults" for all of these. We'll be running in the cloud today, but we'll specify the keys to accessing the cloud. Grab your API key. Looking good.

Alex (01:07):

Just running Gretel by itself gives you an overview of the different CLI commands. Today, as we're going to be labeling data, we'll be focusing on the Gretel models and the Gretel records. Yes. To follow along here, since you're going to go straight to the docs. So let's go ahead and click on the doc link. We're going to go to discover PII and follow along with this tutorial.

Alex (01:28):

So from here, I'm going to start. First thing I want to do is to build a configuration. This tells the Gretel APIs what you want to search for inside data. There are a ton of different info types that you can find with Gretel. And here, if you click on this link, you can see a list of the 40 plus support info types that we have currently. This is always expanding. And you have the ability to define your own info types with Gretel. So we'll do that in the example today, but quite a list of different things that you can add to search for. Back here, we'll go ahead and grab this config, asks us to name it as classified config dynamics. Let's go ahead and do that.

Alex (02:01):

Go ahead and cut and paste it in there. Take a quick look at what it's doing here. So we're asking it to classify it since it's going to be training the classification model. It's going to be running these different labels, searching for person name, credit card number, phone number, and et cetera. Also, we're defining our own with a regular expression here. So here you see us finding a namespace called acne and inside of there, it has its own user ID, regular expression, searching for user underscore and then five, and then your numbers after that to define what a user ID looks like.

Alex (02:36):

Go ahead and save that. And then build a data set quickly. So here we have a data set, ready to go called pii.csv. Go ahead and copy that in. So here we see the columns for the CSV we're sending in, ID, name, email, and et cetera. If you're just sending in plain text, for example, chat logs or Twitter feeds, things like that, you can simply just create a single column named tweets, for example. And then copy the plain text in there. And that will work just fine for you. You see a variety of different types of data here, names, emails, visa, phone numbers, things like that. Go ahead and save it.

Alex (03:20):

Now we're going to create a project. The project is really just a construct inside Gretel that allows you to store models and information. You have the option of running these projects in cloud or locally. For this example, we're going to run these projects in the cloud. Go ahead and copy it in Gretel projects, create. Here, we're going to create a project name, classify example. We're going to make this our default project, so we don't have to specify that anymore. Okay. It's created. It's now our default project. So we don't need to store the project name or project ID.

Alex (03:50):

The next command here, go ahead and copy it in. It's going to train our model on the CSV file we sent in. And we're going to save the results here to a JSON file. It can be used for subsequent calls. So here, we're taking a look at the individual parameters we're using on the Gretel's model's API. I'm telling it to create a model using our classify config. We're giving it an input data, which is pii.csv, and we're telling it to run in the cloud. You also could just do a runner local if you want it to run locally. What's going to happen in that case is in a local configuration. It will download a container that runs locally inside your environment. We're going to store the output to a file called model.json, Which can be used as the input for the next command. Let's go ahead and run that.

Alex (04:43):

It should execute pretty quickly on larger CSV files, even that, it runs pretty quickly. But what it's doing is it's learning to infer different columns. For example, if it detects that a name exists in 90% of columns, it's learning to infer that most likely name, even if not detected inside a particular column, maybe a username or an email address or phone number.

Alex (05:05):

Great. Our model has been created. It looks like it was done correctly. Took about eight seconds of processing here. So now let's go ahead. And the next thing we're going to do is going to classify these records. This. Now we're going to call the Gretel records classify command. Instead of passing in the model ID, we're just going to have a passing this model data.json That we created earlier that contains all the information that this model needs to run. The input data, once again, is going to be API that CSV until it's run in the cloud. And we're going to tell it to output its findings from the cloud directly here to a local file system.

Alex (05:47):

The cloud worker is picking it up. It's just started. And it's on classify. Let's go ahead and grab these. So what I'm going to do is grab this one line right here.

Alex (06:17):

Go ahead and view the results here. So we're looking at one record and all the entities that were found. Here what we see is that JSON aligned limited output per line. So here we're looking at a single line. I'm using the Python utility. This a little bit easier to view. But on index zero, which is the first line, we found the following entities. We see this returned in an array. We can see the character offset within each field that these entities occurred, so in the first field, we found a field called VISA inside it. We found the label called credit card number. Highly confident that we found a credit card number in this and the classifiers here are also exporting the field that are needed to get that kind of confidence in addition to matching on the regular expression.

Alex (07:00):

And we see starting the character offset zero and ending in character offset 16 matched. Going all the way through the rest of the findings here, including the custom regular expression that we find here at the end, found a user ID. Highly confident and defined as well as starting to character offset zero, ending at 10.