How we accidentally discovered personal data in a popular Kaggle dataset
In the coming weeks, the Gretel Public Beta will be available for developers everywhere. We are so excited and we want to discuss some of the features early on so you can start to think about all the workflows that are possible!
In this post we will share some of the upcoming features in Gretel, and how those features enabled us to discover personally identifiable information including full names and email addresses in the popular Lending Club financial dataset on Kaggle.
Gretel helps developers share data more safely so you can build faster and better. To do that, we are building workflows that enable you to better understand your data and then make informed decisions on how to make this data safe to share. This workflow is outlined below:
When you use Gretel, your data’s first stop is our labeling pipeline, for automatic data labeling. We use a combination of Natural Language Processing (NLP) featuring neural network-based entity recognition for names and addresses, managed regular expressions, and custom extractors to label different types of entities. In the NLP space, this is referred to as Named Entity Recognition (NER).
Gretel offers NER in two flavors:
- Asynchronous project streams: Here you can create a Gretel Project, ingest your data through our REST API, and your data will be labeled and ingested into an encrypted, memory-only based stream. The labeled records can also be consumed back through our REST API, allowing a variety of publish/subscribe workflows to be built.
- Synchronous API: Our REST API also has endpoints where you can send a limited number of records and instantly receive a labeled version of the record back. This is great for experimentation, debugging, or just getting a feel for our APIs.
The data structure will be the same whether or not you are using our synchronous API or our publish/subscribe project stream. That will allow you to build applications that function the same regardless of which API you use.
Now, we’ll walk you through automatic data labeling in Gretel, using real Lending Club loan data from Kaggle. (We were impressed by its vast dimensionality of the data set, so have been using it for some of our own testing!) The Lending Club loan dataset is approximately 2.2 million records with 147 fields in each record. We used one of our language client libraries to stream all the records through our API and explore the entity index that was built.
Take a look at the Lending Club example below (we altered the record before testing with it to not reveal any of the original data). All labeled record results have a “records” array. Each element in this array is a pair of the original record we received and our attached metadata. Each element in the “records” array has two top level keys: “data” and “metadata.” The “data” key is the original record we received. The “metadata” key contains all of the information we learned about this record.
Note that we assign a unique “gretel_id” for every record you send us. In synchronous mode, the ID is returned along with all completed data. When you use the asynchronous API, you only receive the ID back and the labeled data is available in your project stream.
We also provide metadata information on a per-field basis. As you can see, in this record (and several others) it appears the “desc” field has a lot of free-form text which can easily contain sensitive information. Any fields that have entities detected will have a “ner” key and an array of label objects. Once this metadata is created, you can plug into our Transformer SDK, use our rule-engine Docker container to do specific routing based on detected entities, or build your own workflows.
Looking closer at the label object structure:
- text: The actual data that was matched
- start: The starting index offset in the string that contains the entity.
- end: The ending index offset in the string that contains the entity.
- label: A unique label, managed by Gretel, that describes this entity. We maintain a catalog of these labels and each entity will have one.
- source: This is a unique name for the backend service which discovered the entity. If the same entity is matched by multiple sources (for example both Spacy and a Regex) then there will be separate label objects for each.
- score: Depending on the detector, we assign a score between 0 and 1 to indicate a confidence level. Because we are optimized for working with semi-structured data, we will use field names to help infer detections which leads to higher scores. Lower scores are for detections that may not have any context from the field name.
If you followed our example closely, you may have realized that as a result of our data labeling, we discovered personally identifiable information in this public data set. To our surprise, we noticed a handful of specific identifiers, like email addresses and phone numbers, when labeling the data. At first, we had assumed they were redacted and substituted with fake surrogate data. Ultimately, the Gretel data labeling service discovered 27 records containing email addresses, first and last names, and phone numbers.
We take privacy very seriously, so we immediately notified Lending Club of our discovery and have been working with their security team to remediate personal information in their public datasets. But had this data been scanned by Gretel originally, these records could have been properly sanitized, fields dropped (like the verbose “desc” field), or the entire record dropped before releasing publicly.