Introducing Gretel's Transform v2
TL;DR: Gretel Transform now enables 10x faster speed, support for arbitrary PII entities, and completely flexible transformation rules. See how you can leverage Transform v2 to de-identify a HIPAA dataset in a few clicks.
To date, the Gretel community has taken advantage of our Transform model to de-identify sensitive personally identifiable information (PII) entities before training synthetic data models. As usage of our synthetic data platform grows, we're hearing firsthand that redacting common PII is only part of the data preparation challenge. Our users also need to ensure that the transformed and synthesized datasets adhere to their unique business rules. Examples include de-identifying PII fields that are uncommon or unique to their organizations (such as gene sequences, product IDs, etc.), or ensuring that generated data meets format or business constraints (for example, email addresses should follow the format firstname.lastname@example.org, or assets must be equal to liabilities).
With the beta release of Transform v2, you now have the flexibility to easily detect arbitrary entities specific to your business, and define custom transformation rules to generate data matching the exact format and logic you need. This feature not only extends Transform’s already robust de-identification capabilities, but also makes Transform v2 adaptable enough beyond de-identification, to cover most dataset pre and post processing use cases.
The cherry on top? Transform v2 is an order of magnitude faster than v1 (100MB/minute in our internal load test), has a more succinct and intuitive configuration format, and fully integrates with the Gretel ecosystem including synthetic data models, connectors, and workflows. For example, you can now build Gretel Workflows that read data directly from your relational database, pre-process (e.g. clean or normalize it) using Transform v2, train and generate synthetic data using ACTGAN or any other Gretel model, then validate the output (e.g. drop records that don’t meet business rules) using Transform v2, before writing it to your data warehouse— all with a few lines of YAML configuration.
Ready to dive in? Given Gretel’s extensive work with medical researchers and practitioners, let’s consider a public health example: Imagine a scenario where a hospital aims to share its patient records with a research institute to facilitate a study on disease patterns. To comply with HIPAA (Health Insurance Portability and Accountability Act) regulations, the hospital needs to ensure PHI (Personal Health Information) is sufficiently anonymized to prevent any possible linkage back to individual patients.
Step 1: Preparing the Dataset
While the hospital may have its patient record dataset readily available, for the purposes of this blog, we generate a synthetic patient record dataset from scratch by prompting Gretel’s Tabular LLM to generate a dataset containing most of HIPAA’s covered entities. Below are the first 3 records of that dataset:
Step 2: Building the Configuration
As with other Gretel models, our next task is to construct a YAML configuration file that specifies HIPAA-related de-identification rules. To keep our configuration file readable, we write a separate rule for each of the 18 covered entities explicitly listed under HIPAA regulations. Our first attempt is below:
Transform v2 uses the column names and a few example values to match columns to the most likely entity type, if any. We set the 'auto_add_ entities' parameter to 'true' to instruct Transform v2 to go beyond default entities, and attempt to detect any custom entity type listed in a row update rule.
We run this config successfully on our dataset using the Transform v2 Blueprint in the Gretel Console, then check the Transform v2 report and confirm that the correct entity was detected for each column:
All PII columns were detected correctly! However, upon looking at the output, we notice that the format does not perfectly match the input data. For example, unlike the input data, generated phone numbers contain extensions, email addresses and URLs that do not match patient names, the numeric portions of medical record numbers and health plan beneficiary numbers are not equal, and along with account numbers, they do not contain the correct prefixes.
Success! We now have a transformed dataset where all entities are correctly detected and replaced with fake values in our desired format.
Step 3: Operationalization
While the above config works well for a one-off, in practice, we recommend combining Transform v2 with differentially private synthetic data generation to meet the highest standards for data anonymization. In a future blog, we will build a scheduled Gretel workflow to automatically Transform and Synthesize all of our hospital’s existing and future patient records in line with HIPAA regulations.
Ready to customize your datasets?
Experience Transform v2’s speed and flexibility in a few clicks by starting with the Console blueprint. Or read more in the Transform v2 docs, where you can find starter configs for various transformation use cases on the Examples page, then dive into the comprehensive feature list and config syntax on the Reference page.
Transform v2 is currently in beta, and we're inviting you to join the forefront of data privacy and customization by becoming a beta tester and sharing your feedback with us. As always, we're here to support your journey, so please reach out on our Discord to share any questions or feedback. Happy transforming!