Introducing Gretel's Transform v2

Leverage Gretel’s New Ultra-Fast and Fully Flexible De-Identification and Rule-Based Transformation Solution for HIPAA Compliance.

TL;DR: Gretel Transform now enables 10x faster speed, support for arbitrary PII entities, and completely flexible transformation rules. See how you can leverage Transform v2 to de-identify a HIPAA dataset in a few clicks. 

To date, the Gretel community has taken advantage of our Transform model to de-identify sensitive personally identifiable information (PII) entities before training synthetic data models. As usage of our synthetic data platform grows, we're hearing firsthand that redacting common PII is only part of the data preparation challenge. Our users also need to ensure that the transformed and synthesized datasets adhere to their unique business rules. Examples include de-identifying PII fields that are uncommon or unique to their organizations (such as gene sequences, product IDs, etc.), or ensuring that generated data meets format or business constraints (for example, email addresses should follow the format firstname.lastname@company.country, or assets must be equal to liabilities).

With the beta release of Transform v2, you now have the flexibility to easily detect arbitrary entities specific to your business, and define custom transformation rules to generate data matching the exact format and logic you need. This feature not only extends Transform’s already robust de-identification capabilities, but also makes Transform v2 adaptable enough beyond de-identification, to cover most dataset pre and post processing use cases.

The cherry on top? Transform v2 is an order of magnitude faster than v1 (100MB/minute in our internal load test), has a more succinct and intuitive configuration format, and fully integrates with the Gretel ecosystem including synthetic data models, connectors, and workflows. For example, you can now build Gretel Workflows that read data directly from your relational database, pre-process (e.g. clean or normalize it) using Transform v2, train and generate synthetic data using ACTGAN or any other Gretel model, then validate the output (e.g. drop records that don’t meet business rules) using Transform v2, before writing it to your data warehouse— all with a few lines of YAML configuration.

Ready to dive in? Given Gretel’s extensive work with medical researchers and practitioners, let’s consider a public health example: Imagine a scenario where a hospital aims to share its patient records with a research institute to facilitate a study on disease patterns. To comply with HIPAA (Health Insurance Portability and Accountability Act) regulations, the hospital needs to ensure PHI (Personal Health Information) is sufficiently anonymized to prevent any possible linkage back to individual patients.

Step 1: Preparing the Dataset

While the hospital may have its patient record dataset readily available, for the purposes of this blog, we generate a synthetic patient record dataset from scratch by prompting Gretel’s Tabular LLM to generate a dataset containing most of HIPAA’s covered entities. Below are the first 3 records of that dataset:

Figure 1. Example of some records from our health dataset.

Step 2: Building the Configuration

As with other Gretel models, our next task is to construct a YAML configuration file that specifies HIPAA-related de-identification rules. To keep our configuration file readable, we write a separate rule for each of the 18 covered entities explicitly listed under HIPAA regulations. Our first attempt is below:

# Transform v2 config template for de-identifying datasets for HIPAA compliance based on the safe harbor method:
# https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html#coveredentities
schema_version: "1.0"
models:
  - transform_v2:
      globals:
        classify:
          enable: true
          entities: []
          auto_add_entities: true
        locales: [en_US]
      steps:
        - rows:
            update:
              # (A) Names
              - entity: [name, first_name, last_name]
                value: column.entity | fake
              # (B) Geographic subdivisions smaller than a state
              - entity: [address, street_address, city, postalcode]
                value: column.entity | fake
              # (C) All elements of dates except year, and ages over 89
              - entity: date
                value: this | date_shift("-1y", "+1y")
              # (D) Telephone numbers
              - entity: phone_number
                value: fake.phone_number()
              # (E) Fax numbers
              - entity: fax_number
                value: fake.phone_number()
              # (F) Email addresses
              - entity: email
                value: fake.ascii_email()
              # (G) Social security numbers
              - entity: ssn
                value: fake.ssn()
              # (H) Medical record numbers
              - entity: medical_record_number
                value: fake.random_number(digits=10, fix_len=True)
              # (I) Health plan beneficiary numbers
              - entity: health_plan_beneficiary_number
                value: fake.random_number(digits=10, fix_len=True)
              # (J) Account numbers
              - entity: account_number
                value: fake.bban()
              # (K) Certificate/license numbers
              - entity: certificate_number
                value: fake.bothify(text="??-########")
              # (L) Vehicle identifiers and serial numbers, including license plate numbers
              - entity: vehicle_identifier
                value: fake.vin()
              - entity: license_plate
                value: fake.license_plate()
              # (M) Device identifiers and serial numbers
              - entity: device_serial_number
                value: fake.hexify(text="^^^^-^^^^-^^^^-^^^^", upper=False)
              # (N) Web Universal Resource Locators (URLs)
              - entity: url
                value: fake.url()
              # (O) Internet Protocol (IP) addresses
              - entity: ip_address
                value: fake.ipv4()
              # (P) Biometric identifiers, including finger and voice prints
              - entity: [biometric_identifier, fingerprint, voiceprint]
                value: '"<" ~ column.entity ~ ">"'
              # (Q) Full-face photographs and any comparable images
              - entity: image
                value: fake.image_url()
              # (R) Any other unique identifying number, characteristic, or code
              - entity: unique_identifier
                value: fake.uuid4()

Transform v2 uses the column names and a few example values to match columns to the most likely entity type, if any. We set the 'auto_add_ entities' parameter to 'true' to instruct Transform v2 to go beyond default entities, and attempt to detect any custom entity type listed in a row update rule.

We run this config successfully on our dataset using the Transform v2 Blueprint in the Gretel Console, then check the Transform v2 report and confirm that the correct entity was detected for each column:

Patient: name
Location: address
DOB: date
PhoneNo: phone_number
VIN: vehicle_identifier
Fax: fax_number
SerialNo: device_serial_number
Email: email
URL: url
SSN: ssn
IP: ip_address
MRN: medical_record_number
HPBN: health_plan_beneficiary_number
AN: account_number
LN: license_plate
Image: image

All PII columns were detected correctly! However, upon looking at the output, we notice that the format does not perfectly match the input data. For example, unlike the input data, generated phone numbers contain extensions, email addresses and URLs that do not match patient names, the numeric portions of medical record numbers and health plan beneficiary numbers are not equal, and along with account numbers, they do not contain the correct prefixes.

Figure 2. Example of our synthetically generated records.
- entity: [phone_number, fax_number]
  value: fake.numerify("%##-%##-####")
- entity: email
  value: row.Patient.replace(" ", ".") | normalize(".") | lower ~ "@" ~ fake.domain_name()
- entity: url
  value: '"linkedin.com/in/" ~ row.Patient | normalize | lower'
- entity: medical_record_number
  value: fake.numerify("MR######")
- entity: health_plan_beneficiary_number
  value: '"HPN" ~ row.MRN[2:]'
- entity: account_number
  value: fake.bothify("?######") | upper

Success! We now have a transformed dataset where all entities are correctly detected and replaced with fake values in our desired format.

Figure 3. Example of our successfully transformed records.

Step 3: Operationalization

While the above config works well for a one-off, in practice, we recommend combining Transform v2 with differentially private synthetic data generation to meet the highest standards for data anonymization. In a future blog, we will build a scheduled Gretel workflow to automatically Transform and Synthesize all of our hospital’s existing and future patient records in line with HIPAA regulations.

Ready to customize your datasets?

Experience Transform v2’s speed and flexibility in a few clicks by starting with the Console blueprint. Or read more in the Transform v2 docs, where you can find starter configs for various transformation use cases on the Examples page, then dive into the comprehensive feature list and config syntax on the Reference page.

Transform v2 is currently in beta, and we're inviting you to join the forefront of data privacy and customization by becoming a beta tester and sharing your feedback with us. As always, we're here to support your journey, so please reach out on our Discord to share any questions or feedback. Happy transforming!