Anonymize Data with S3 Object Lambda
Anonymize data at access time with Gretel and Amazon S3 Object Lambda.
Just recently, AWS announced its support for S3 Object Lambda. The possibilities and use cases here are endless. The first thing we thought of is being able to use Gretel Transforms to anonymize S3 object data real-time as it's requested, without having to create a duplicated version of the data to be stored.. We've also been itching to test out containers for AWS Lambda so putting together an example stack to do all of these things felt like a great project to work on.
In this blog we use the following stack to launch a real-time S3 object transformation service. At the end of this blog, you will be able to launch a end-to-end service that does real time anonymization of data in S3 with a REST API.
This GitHub Repo contains all the necessary ingredients to cook up some transforms with S3 Object Lambda along with the Serverless configuration to launch our entire AWS CloudFormation stack.
If you want the TL;DR, you can run this code:
Let's dive in...
For this demo, let's assume you have a S3 bucket with structured data that ideally you would want to make available to other teams, organizations, or customers. Additionally, in order to share this information, you need to redact or transform some of the content to preserve privacy.
Secondly, your downstream consumers may not need to access every object, but instead selectively choose an object to retrieve. Let's assume objects are named by date: "YYY-MM-DD.csv" and a user would just want to occasionally fetch data for a specific date.
One of the leading strategies to support this is to "transform-on-write", meaning when a S3 object gets created initially, some workflow executes that creates the transformed version of the object and stores that separately in S3.
At this point, you can generate a pre-signed S3 URL to give access to the transformed object.
One drawback to this strategy is that for every object you store, a transformed version must be also stored. If access to the transformed objects is sparse, this could really impact your costs for both storage and compute for transforms.
With the new S3 Object Lambda feature, users can request access to S3 objects that are transformed in real-time as they are returned to the client. Super cool!
Stepping through the AWS blog post, we see the API call to download the object from the Object Lambda Access Point:
After seeing this, my next thought was “does creating pre-signed S3 URLs work for this?” After creating my own transform by hand, it turns out creating a pre-signed URL does indeed work. So modifying the above code, we can get a pre-signed URL that does the same exact thing.
This is nice, because the consumer of the data does not necessarily have to issue their own specific API call to S3 to get the data, provided the owner of the data creates easy access to a pre-signed URL.
One easy way to provide pre-signed S3 URLs would be by providing a REST API endpoint that allowed someone to request access to a transformed version of a S3 object. For this scenario, because our objects (filenames) are structured by date, we could have a simple endpoint that allows clients to request a specific object by filename:
The API handler code could generate a pre-signed S3 URL and return that to the client. Then performing a GET request on the returned URL would execute a call to the S3 Object Lambda Access Point, transforming and returning the S3 object at the same time.
To test this solution out, we need to do a few things:
- Build and test a transform pipeline for the source data
- Create a Lambda function that is the actual transform used by S3 Object Lambda Access Points
- Create an API endpoint for generating the pre-signed URL to fetch and transform an S3 Object
To transform the data, we’ll use Gretel’s open source transforms. For this particular dataset we’ll run the following operations across specific field names.
- Consistently replace names with mock ones
- Shift identifiable dates by +/- 10 days
- Partially redact employee IDs
This pipeline can be constructed using Gretel Transforms:
Once this is constructed, the pipeline object has a transform_df(df: pd.DataFrame) method that can be used to transform an entire DataFrame.
In our application code, we read in the original S3 Object body as a string, and we need to return it as a string. So we have a utility function that converts the body to a Pandas DataFrame, runs our transform, and returns back a string:
Next let’s start to build out our stack. Feel free to follow along if desired, we can start by cloning the repo:
The core logic is in our handler.py module. We can use this module to build a Docker container that is built on the AWS Lambda Python image. This container can be run locally to test any function we have defined in handler.py and also will be deployed to ECR so it can be used as the runtime image for our actual Lambda functions. The Dockerfile is pretty straight forward here:
NOTE: The base Lambda images, at the time of this blog, have version 1.16.x of boto3 installed. The new API calls required for S3 Object Lambda require boto3 1.17+, so we directly install a new copy of boto3 into the Lambda task directory. In our handler.py code, we’ll patch the Python path by having it look for packages in the root first, so we can make use of these new API calls.
By default, we’ll set the s3_proxy_handler as the default entry point for the Lambda container. But we can override this for other Lambda functions and use cases as needed.
We can test our transform logic locally by using the built-in Lambda emulator the base container provides. To do this, you can build the container locally and run a helper script that is in the repo:
NOTE: The overridden entry point here, handler.record_handler lets us use a different handler just for running the _transform_data function from above.
If things work, you should see the first records from the transformed dataset.
Next are the specific handlers that will run on AWS Lambda. These are in the code as the following functions:
To manage our AWS infrastructure, we’ll use Serverless.
Huge shout out to the work these folks do, it’s truly made the Developer Experience (DX) of building complex AWS solutions way more enjoyable.
The entire stack is managed through the serverless.yml file. With NodeJS installed on your system, you can install Serverless via npm install serverless.
This files has a few key sections to call out:
NOTE: If you want to deploy this yourself, you will have to modify the custom.baseName property in the serverless.yml file. All of the AWS resources stem from that name, and the bucket names cannot be shared.
To deploy we just run a simple:
If things go smoothly, you should see some output that ends with:
Since we’ve already deployed this in our infrastructure, you can always just test our live endpoint:
The new S3 Object Lambda opens a tons of possibilities! We really enjoyed putting together this demo stack because it really shows how the Developer Experience for cloud engineering is improving and of course we are even more thrilled that our own Gretel Transforms can drop in place to a workflow like this!