Got text? Use Named Entity Recognition (NER) to label PII in your data

Use Gretel’s NLP setting to label PII including people names and geographic locations in free text.
Copyright (c) 2021 Gretel
Copyright (c) 2021 Gretel
In this post, we'll walk you through analyzing a Yelp reviews dataset and labeling people names and geographic locations

Is your unstructured data safe? 

80% of data is unstructured, and yet most companies are unprepared when it comes to labeling and securing it. Those that do have processes in place often use laborious and error-prone manual reviews to secure their data, which is neither reliable nor scalable.

Gretel’s data classification APIs make it simple to identify and label sensitive content in both structured and unstructured data. Gretel can currently detect over 40 entity types including names, addresses, gender, birthday, and credentials. We do this through a combination of regular expression-based detections, custom detectors for entities based on FastText, and word embeddings. And we now also support spaCy's industrial strength Natural Language Processing (NLP) for identifying person names and locations.

Let’s get started! 

Sign up for a free account using the Gretel Console, and then follow along below. Or use our Jupyter Notebook to get started! In this example, we'll walk you through labeling free text in a dataset of restaurant reviews using Gretel’s NLP and labeling APIs.

First, let's install all our python dependencies and configure the Gretel client.

!pip install -Uqq gretel-client spacy datasets

Next, we'll connect to Gretel’s API service. You'll need your API key for this step.

from getpass import getpass
import json
import datasets
import pandas as pd
import yaml
from smart_open import open
from gretel_client import create_project, poll, configure_session, ClientConfig

pd.set_option('max_colwidth', None)

dataset_file_path = "reviews.csv"

        api_key=getpass(prompt="Enter Gretel API key"),

We can now load a sample dataset containing PII into Pandas. Let’s use the Yelp Reviews dataset from HuggingFace’s datasets library.

source_dataset = datasets.load_dataset("yelp_review_full")
source_df = pd.DataFrame(source_dataset["train"]).sample(n=300, random_state=99)
source_df.to_csv(dataset_file_path, index=False)

Next, let's create a project in Gretel for labeling our data.

project = create_project(display_name="Gretel NLP Yelp Reviews")

We'll use a very simple configuration file. NLP can be enabled in a single line. Note: Using spaCy can decrease model prediction throughput by up to 70%, that's why it's turned off by default.

# Passing `use_nlp: true` into the model config,
# enables additional predictions using NLP models.
classify_config = """
schema_version: "1.0"
  - classify:
      data_source: "_"
      use_nlp: true

The next step is to create the classification model and verify it to make sure it has been configured correctly.

model = project.create_model_obj(yaml.safe_load(classify_config), dataset_file_path)

# `report_json` contains a summary of entities by field
with open(model.get_artifact_link("report_json")) as fh:
    report = json.load(fh)
# By converting these summaries into a dataframe we can quickly view
# entities found by the model.
summary = []
for field in report["metadata"]["fields"]:
    row = {"name": field["name"]}
    for entity in field["entities"]:
        row[entity["label"]] = entity["count"]


Once the model has been configured and verified, all we have to do is run the full dataset through it.

records = model.create_record_handler_obj(data_source=dataset_file_path)

# the `data` artifact returns a JSONL formatted file containing
# entity predictions by row.
with open(records.get_artifact_link("data")) as fh:
    records = [json.loads(line) for line in fh]

Here's a sample of the labeled reviews. If you're interested in how we color-coded the labels, check out our blueprint notebook in Github or run it in Google Colab. You can also use this blueprint to label your own datasets using NLP.

Labeled Yelp Reviews
Labeled Yelp Reviews

Did you know?

In addition to classifying sensitive data in unstructured text, you can also transform it. Gretel includes support for replacing names and addresses with fake versions, or securely hashing PII (personally identifiable information) to meet compliance requirements. Here's a 6-min video walkthrough on redacting PII from a dataset using the Gretel CLI, or get all the details from our docs. Don’t forget to set use_nlp: true in your config file.

As with all of Gretel’s APIs, you can get started quickly with SaaS, or deploy our NLP into your own cloud or VPC. Sign up to get your API key and see for yourself how easy it is.

What do you think?

Did you find this blog post interesting? Have any use cases you'd like us to write about? Email us, tweet us, or ping us on Slack. We'd love to hear from you!