Automate Detecting Sensitive Personally Identifiable Information (PII)
In this blog, we walk through using Gretel to label and transform streaming data- and apply a policy to redact and replace sensitive content in both structured and free text with safe values.Â
Why is continuous detection important?Â
When you are working directly with customer feedback, application logs, or even public datasets- it is important to identify and protect sensitive information to maintain compliance with standards such as GDPR and CCPA and to earn customer’s trust. Many companies discover and label personal data through time-consuming and error prone manual reviews of their databases and datasets. In today's streaming data pipelines and architectures, the shape and compliance properties of data can change in seconds- whether from a change in data being logged by a developer, to users storing sensitive data in fields that are not meant for it.
Gretel’s named entity recognition APIs make it simple to identify, label, and transform sensitive content continuously, before it is ever written to disk and becomes a compliance problem. Let’s get started!Â
What kinds of sensitive data can Gretel detect?
Gretel can currently detect over 40 entity types in both structured and unstructured data including names, addresses, gender, birthday, and credentials. We do this through a combination of regular expression-based detections, custom detectors for entities based on FastText and word embeddings, and support for bringing your own custom named entity recognition models from spacy.io and HuggingFace (coming soon).
Let’s write some code!Â
Follow along below, or sign up for a free account and launch our Jupyter Notebook to get started! In this example, we walk through a Blueprint detailing how to label and anonymize free text using Gretel’s NLP and labeling APIs.
First, connect to Gretel’s API service.
Next, we need a sample dataset containing PII into Pandas. Let’s use the Enron email dataset from HuggingFace’s excellent datasets library.
Create a temporary project in Gretel for labeling data.
Now, iterate and display the labeled text stream as HTML.
Build a transformation pipeline
After labeling the dataset, we've identified chats that contain PII, such as names and emails. The final step in this blueprint is to build a transformation pipeline that will replace names and other identifying information with fake representations of the data.
We make a point to replace rather than redact sensitive information. This preservation ensures the dataset remains valuable for downstream use cases such as machine learning, where the structure and contents of the data are essential.
To learn more about data transformation pipelines with Gretel, check our website or SDKÂ Documentation.
Configure the pipeline. FakeConstantConfig will replace any entities configured under labels with a fake version of the entity.
Run the pipeline to redact any sensitive strings.
Inspect a transformed email from the dataset.
Thanks for taking the time to read this article. Hopefully, after following along, you’ve seen how it’s possible to label and anonymize sensitive streams of data in real-time.
‍Sign up for a free Gretel account to access our premium SDKs and get started in seconds with the code above in Jupyter notebook format below- Work Safely with Free Text Using Gretel.
Next steps
Interested in running Gretel's labeling API as a container in your cloud or on-premises environment? We're in private beta and would love to hear from you! The fastest way to reach our team is via our community Slack channel, email us at hi@gretel.ai, or file an issue on Github.