Detect and redact PII in free text with NER in Transform v2

Ensure data quality and privacy by applying flexible de-identification and rule-based transformations to real and synthetic datasets.
Copyright 2024, Gretel.ai
Copyright 2024, Gretel.ai

Whether sharing data internally, training large language models (LLMs), or building vector stores for retrieval augmented generation (RAG), organizations constantly juggle the need to protect personally identifiable information (PII) against the demand to extract valuable insights.

With unstructured data types like call center transcripts or clinical and doctor notes, adequately protecting PII requires both sophisticated models that can label entities based on their values and the surrounding context, as well as flexible transformations that can remove PII entities while maintaining maximum utility of the text for downstream applications.

Transform v2 is a powerful tool that simplifies the process of detecting and redacting PII in tabular and free text formats using Named Entity Recognition (NER). With this update, you can now:

  1. Label custom PII entity types by simply listing them in the config, without the need for additional training.
  2. Utilize four robust NER functions to remove PII entities while optimizing the de-identified text for your specific use case.

Meet the four NER functions

Below is a brief description of the new NER functions you can use. For a deeper dive, check out the Transform v2 docs.

redact_entities

Transform sensitive information into entity types. For example, "Sally" becomes <first_name>. This function is ideal for anonymizing data while retaining semantic meaning.

hash_entities

Replace detected entities with a unique salted hash. For instance, "Sally" could turn into 515acf74f. This ensures privacy while allowing for consistent pseudonymization.

fake_entities

Replace PII entities with random, fake values of the same type. For example, "Sally" may transform into "Joe". You can also customize fallback behavior (defaults to redaction) for entity types not supported by the Faker library. For example, to hash entities in the current column not supported by Faker instead of hashing them, you could use: this | fake_entities(on_error=”hash”).

label_entities

Annotate your data with both the entity type and the value. For instance, "Sally" becomes <entity type="first_name" value="Sally">. This is particularly useful for downstream processing such as generating detailed HTML reports similar to the one below, where we used displaCy to color detected entities (including custom entities such as invoice_number, tax_registration_number, and product_name) in a PDF invoice.

Use our default entity list or create your own

As with other Transform v2 features, the new NER functions are highly versatile. Unlike legacy systems, Transform v2 NER does not confine you to a predefined set of entities. You can list any entities you need to detect in the Transform v2 config, and the model will interpret their semantic meaning and attempt to detect them, without requiring prior training (YMMV). For example, a healthcare organization could use the Transform v2 config below to detect and replace non-default entity types such as medical record numbers and blood types:

schema_version: "1.0"
name: "redact-pii-nlp"
models:
  - transform_v2:
      data_source: "_"
      globals:
        classify:
          entities:
            - name
            - ssn
            - medical_record_number
            - blood_type
      steps:
        - rows:
            update:
              - type: text
                value: this | hash_entities

Ready to protect your data?

You can access Transform v2 NER in the Gretel Console or using our SDK. If you are looking for inspiration, check out our notebooks demonstrating basic SDK usage for PII redaction, and how to extract content and redact PII from PDF files to prepare them for a downstream RAG application.