Auto-anonymize production datasets for development
In this post, we walk through building a data pipeline that will automatically transform datasets so they can be safely used in development environments.
Developers want to work with data that closely mirrors what’s in production. This makes it easier to catch bugs during testing and it gives teams confidence that each release remains stable.
A common practice is to replicate either a subset or the entirety of a production database into a pre-production environment such as staging. This maximizes data parity between the two environments, but introduces challenges when ensuring customer privacy isn’t compromised.
To prevent customer details from leaking into the dataset, a developer might write a pipeline that defines specific anonymizing transformations per table and per column. While this process is effective, there’s friction. It requires a deep understanding of the underlying dataset and its statistical properties. It requires an understanding of various anonymization techniques. It requires ongoing maintenance as the upstream schemas evolve. If these factors aren’t properly managed, there’s risk customer details might leak.
In this blog post, we walk through an open-source code blueprint that details how to build a data pipeline to auto-anonymize streaming data using Gretel.ai’s SDKs. You can easily modify the blueprint for your own anonymization project, making the process easy and automatic while reducing risk.
In this blueprint we’ll be anonymizing a dataset containing customer bike orders with full names and addresses.
1. Sign into the Gretel console with your Gmail or Github account at https://console.gretel.cloud.
2. Create a new data project. Name the project whatever you like.
3. From the “More” menu next to “Upload Files”, select “Load test dataset”, select “customer-bike-orders” and then “Continue”.
4. After the dataset has loaded, click the “Integration” tab, and copy the project URI to the clipboard. You will need this to allow the Python notebook to connect to the “Bike Orders” project in Gretel.
5. Click “Transform”, and then launch the “Auto-anonymize production datasets for development” blueprint. This will launch the notebook in Google Colab’s free notebook environment. Alternatively, you can download the notebook from Colab and run in your own environment.
6. Click “Run all” to run the blueprint code, which uses the URI key to access your Gretel project, generate a ruleset for anonymization, and automatically anonymize the data. Details below!
An overview of the pipeline
Let’s start with some code. The snippet below is adapted from the blueprint notebook and represents an end to end pipeline. We start with a sensitive dataset represented as a stream of customer bike orders, sample_records, and produce an anonymized copy, xf_records.
Next, we’ll break the code snippet down into three core components and dive into the details.
Labeling and Discovery
Using send_batch, inbound source data is passed through Gretel’s Data Cataloging service where entities such as names, emails and addresses are identified. Gretel will also calculate various statistical properties of the dataset such as cardinality, uniqueness and distribution. These properties will help determine an optimal transformation pipeline.
At the heart of the pipeline is a rules engine. build_pipeline accepts a Gretel project and streams down the enriched source dataset for evaluation against a managed set of rules. These rules determine what transformations are applied to the dataset.
For example, if a field is identified as containing a person’s name, a rule will be matched that automatically replaces the contents of each column’s field with a fake name.
Gretel ships a convenience function in the blueprint notebook, rule_inspector, that helps us evaluate the results of a run.
Looking at the screenshot we see all the fields, rules and transformations that the pipeline was able to detect. Applying these rules will give us an anonymized dataset.
Note: In this demonstration, we run the rules engine against a batched set of “training” data. Depending on the use case the engine can be configured to run “online” against streaming data.
After a set of rules have been matched, we generate a transformation pipeline from those rules using Gretel’s transformer package. With the pipeline now built, pipeline.xf_project will process each project record through the transformation pipeline for anonymization.
Referring back to the code snippet, xf_records now contains an anonymized copy of our source dataset. Using another helper function from the blueprint notebook, df_diff, we can perform a row-wise comparison between an original and transformed record.
Notice the “Transformed” column contains an anonymized version of the record. These transformed records are now ready to be pushed into a downstream pre-production environment without risk of leaking customer details.
Hopefully after reading this you’ve seen how it’s possible to take a dataset with an unknown schema and compose together several Gretel services in order to create an anonymized version of the dataset. Thanks for following along!
Interested in building your own pipeline for anonymizing datasets? Sign up for a free Gretel account to access our premium SDKs and get started in seconds with our blueprint, Auto-anonymize production datasets for development.