Detect and redact PII in free text with NER in Transform v2
Whether sharing data internally, training large language models (LLMs), or building vector stores for retrieval augmented generation (RAG), organizations constantly juggle the need to protect personally identifiable information (PII) against the demand to extract valuable insights.
With unstructured data types like call center transcripts or clinical and doctor notes, adequately protecting PII requires both sophisticated models that can label entities based on their values and the surrounding context, as well as flexible transformations that can remove PII entities while maintaining maximum utility of the text for downstream applications.
Transform v2 is a powerful tool that simplifies the process of detecting and redacting PII in tabular and free text formats using Named Entity Recognition (NER). With this update, you can now:
- Label custom PII entity types by simply listing them in the config, without the need for additional training.
- Utilize four robust NER functions to remove PII entities while optimizing the de-identified text for your specific use case.
Meet the four NER functions
Below is a brief description of the new NERÂ functions you can use. For a deeper dive, check out the Transform v2 docs.
redact_entities
hash_entities
fake_entities
label_entities
Use our default entity list or create your own
As with other Transform v2 features, the new NER functions are highly versatile. Unlike legacy systems, Transform v2 NER does not confine you to a predefined set of entities. You can list any entities you need to detect in the Transform v2 config, and the model will interpret their semantic meaning and attempt to detect them, without requiring prior training (YMMV). For example, a healthcare organization could use the Transform v2 config below to detect and replace non-default entity types such as medical record numbers and blood types:
Ready to protect your data?
You can access Transform v2 NER in the Gretel Console or using our SDK. If you are looking for inspiration, check out our notebooks demonstrating basic SDK usage for PII redaction, and how to extract content and redact PII from PDF files to prepare them for a downstream RAG application.