Anonymize Data with S3 Object Lambda

Anonymize data at access time with Gretel and Amazon S3 Object Lambda.

Just recently, AWS announced its support for S3 Object Lambda.  The possibilities and use cases here are endless.  The first thing we thought of is being able to use Gretel Transforms to anonymize S3 object data real-time as it's requested, without having to create a duplicated version of the data to be stored..  We've also been itching to test out containers for AWS Lambda so putting together an example stack to do all of these things felt like a great project to work on.

In this blog we use the following stack to launch a real-time S3 object transformation service.  At the end of this blog, you will be able to launch a end-to-end service that does real time anonymization of data in S3 with a REST API.

This GitHub Repo contains all the necessary ingredients to cook up some transforms with S3 Object Lambda along with the Serverless configuration to launch our entire AWS CloudFormation stack.

If you want the TL;DR, you can run this code:

import requests

OBJECT = "2021-03-22.csv"

url_data = requests.get(
    "https://yh3glj0b89.execute-api.us-east-2.amazonaws.com/dev/share",
    params={"key": OBJECT}
)

retrieval_url = url_data.json()["url"]
transformed_data = requests.get(retrieval_url)
print(transformed_data.text)

Let's dive in...

Scenario

For this demo, let's assume you have a S3 bucket with structured data that ideally you would want to make available to other teams, organizations, or customers.  Additionally, in order to share this information, you need to redact or transform some of the content to preserve privacy.

Secondly, your downstream consumers may not need to access every object, but instead selectively choose an object to retrieve. Let's assume objects are named by date: "YYY-MM-DD.csv" and a user would just want to occasionally fetch data for a specific date.

One of the leading strategies to support this is to "transform-on-write", meaning when a S3 object gets created initially, some workflow executes that creates the transformed version of the object and stores that separately in S3.

At this point, you can generate a pre-signed S3 URL to give access to the transformed object.

One drawback to this strategy is that for every object you store, a transformed version must be also stored. If access to the transformed objects is sparse, this could really impact your costs for both storage and compute for transforms.

Solution Design

With the new S3 Object Lambda feature, users can request access to S3 objects that are transformed in real-time as they are returned to the client. Super cool!

Stepping through the AWS blog post, we see the API call to download the object from the Object Lambda Access Point:

print('Object processed by S3 Object Lambda:')
transformed = s3.get_object(
  Bucket='arn:aws:s3-object-lambda:us-east-1:123412341234:accesspoint/myolap',
  Key='s3.txt')
print(transformed['Body'].read().decode('utf-8'))

After seeing this, my next thought was “does creating pre-signed S3 URLs work for this?” After creating my own transform by hand, it turns out creating a pre-signed URL does indeed work.  So modifying the above code, we can get a pre-signed URL that does the same exact thing.

url = s3.generate_presigned_url(
  "get_object",
  Params={
    "Bucket": "arn:aws:s3-object-lambda:us-east-1:123412341234:accesspoint/myolap",
    "Key": "s3.txt"
  }
)
print(requests.get(url).text)

This is nice, because the consumer of the data does not necessarily have to issue their own specific API call to S3 to get the data, provided the owner of the data creates easy access to a pre-signed URL.

One easy way to provide pre-signed S3 URLs would be by providing a REST API endpoint that allowed someone to request access to a transformed version of a S3 object. For this scenario, because our objects (filenames) are structured by date, we could have a simple endpoint that allows clients to request a specific object by filename:

GET /endpoint?key=YYYY-MM-DD.csv

The API handler code could generate a pre-signed S3 URL and return that to the client. Then performing a GET request on the returned URL would execute a call to the S3 Object Lambda Access Point, transforming and returning the S3 object at the same time.

Implementation

To test this solution out, we need to do a few things:

  1. Build and test a transform pipeline for the source data
  2. Create a Lambda function that is the actual transform used by S3 Object Lambda Access Points
  3. Create an API endpoint for generating the pre-signed URL to fetch and transform an S3 Object

To transform the data, we’ll use Gretel’s open source transforms. For this particular dataset we’ll run the following operations across specific field names.

  • Consistently replace names with mock ones
  • Shift identifiable dates by +/- 10 days
  • Partially redact employee IDs

This pipeline can be constructed using Gretel Transforms:

from gretel_client.transformers import (
    DateShiftConfig,
    RedactWithCharConfig,
    FakeConstantConfig,
    DataTransformPipeline,
    DataPath,
    FieldRef,
    StringMask,
)

date_shifter = DateShiftConfig(
    secret="2B7E151628AED2A6ABF7158809CF4F3CEF4359D8D580AA4F7F036D6F04FC6A95",
    lower_range_days=-10,
    upper_range_days=10,
    date_format="%m/%d/%Y",
    tweak=FieldRef("Employee_Name"),
)

name_replacer = FakeConstantConfig(seed=8675309, fake_method="name")

# Mask up until the last 4 chars
id_redactor = RedactWithCharConfig(
    mask=[StringMask(end_pos=-4)]
)


paths = [
    DataPath(input="*Name", xforms=[name_replacer]),
    DataPath(input="DOB", xforms=[date_shifter]),
    DataPath(input="DateofHire", xforms=[date_shifter]),
    DataPath(input="EmpID", xforms=[id_redactor]),
    # Uncomment to allow all other fields to pass through
    # DataPath(input="*")
]

pipeline = DataTransformPipeline(data_paths=paths)

Once this is constructed, the pipeline object has a transform_df(df: pd.DataFrame) method that can be used to transform an entire DataFrame.

In our application code, we read in the original S3 Object body as a string, and we need to return it as a string. So we have a utility function that converts the body to a Pandas DataFrame, runs our transform, and returns back a string:

def _transform_data(data: str) -> str:
    """Take the full CSV data, load to a DataFrame,
    transform and return the new version.
    """
    df = pd.read_csv(io.StringIO(data))
    xf_df = pipeline.transform_df(df)
    out_str = io.StringIO()
    xf_df.to_csv(out_str, index=False)
    return out_str.getvalue()

Next let’s start to build out our stack. Feel free to follow along if desired, we can start by cloning the repo:

$ git clone https://github.com/gretelai/s3-object-lambda-gretel-transforms.git

The core logic is in our handler.py module. We can use this module to build a Docker container that is built on the AWS Lambda Python image. This container can be run locally to test any function we have defined in handler.py and also will be deployed to ECR so it can be used as the runtime image for our actual Lambda functions.  The Dockerfile is pretty straight forward here:

FROM public.ecr.aws/lambda/python:3.8

COPY . ./
RUN pip install -r requirements.txt

# Hack to get latest boto3 for newest APIs
RUN pip install boto3 -t .

CMD ["handler.s3_proxy_handler"]

NOTE: The base Lambda images, at the time of this blog, have version 1.16.x of boto3 installed. The new API calls required for S3 Object Lambda require boto3 1.17+, so we directly install a new copy of boto3 into the Lambda task directory. In our handler.py code, we’ll patch the Python path by having it look for packages in the root first, so we can make use of these new API calls.

By default, we’ll set the s3_proxy_handler as the default entry point for the Lambda container. But we can override this for other Lambda functions and use cases as needed.

We can test our transform logic locally by using the built-in Lambda emulator the base container provides. To do this, you can build the container locally and run a helper script that is in the repo:

$ docker build -t gretel-transform-test .
$ docker run -it -p 9000:8080 gretel-transform-test "handler.record_handler"
$ python test_local.py

NOTE: The overridden entry point here, handler.record_handler lets us use a different handler just for running the _transform_data function from above.

If things work, you should see the first records from the transformed dataset.

Next are the specific handlers that will run on AWS Lambda. These are in the code as the following functions:

  • s3_proxy_handler: This function will be the actual lambda that is configured to run for the S3 Object Lambda Access Point.
  • get_sharing_url: This function will be the handler for the API Gateway endpoint which generates a pre-signed S3 URL for the user and returns it synchronously. An API Gateway service is automatically launched because of the events.http declaration we provide. This API endpoint / function also will parse out a key URL query parameter as the S3 object name.
    • NOTE: The signed URL will get created even if the S3 object does not exist. A user facing 404 will be returned only when the signed URL is actually used.

Ship it!

To manage our AWS infrastructure, we’ll use Serverless.

Huge shout out to the work these folks do, it’s truly made the Developer Experience (DX) of building complex AWS solutions way more enjoyable.

The entire stack is managed through the serverless.yml file. With NodeJS installed on your system, you can install Serverless via npm install serverless.

This files has a few key sections to call out:

  • provider: This section contains a specific ecr block that will actually look for a Dockerfile, build the container and upload to ECR for you. With this image in ECR, you can reference it as the source for your Lambda functions later on.
  • iam: Because we are generating a pre-signed URL for the S3 Object Lambda Access Point, we have to ensure the Lambda function that is creating this pre-signed URL has the permissions to run the actual transform lambda and have access to S3.
  • functions: These are the actual Lambda functions that will get built. Note that image.name points to the value we specified in the ecr section. Serverless resolves this all for us automatically when launching the Cloud Formation stack. Also, we override the entry point for the gretelshare handler to the get_sharing_url function. When providing a custom command here, the entryPoint must be provided. When using the AWS Base Images (public.ecr.aws/lambda/python:3.8), this entryPoint will always be /lambda-entrypoint.sh.
  • resources: This section contains the actual CloudFormation statements needed to build out the following resources for our system:
    • An S3 bucket.
    • An S3 Access Point for our bucket.
    • An S3 Object Lambda Access Point which is attached to our S3 Access Point and references our specific greteltransform Lambda function. This lambda function will be invoked every time an S3 Object is requested from the Object Lambda Access Point.

NOTE: If you want to deploy this yourself, you will have to modify the custom.baseName property in the serverless.yml file. All of the AWS resources stem from that name, and the bucket names cannot be shared.

To deploy we just run a simple:

$ sls deploy

If things go smoothly, you should see some output that ends with:

endpoints:
  GET - https://yh3glj0b89.execute-api.us-east-2.amazonaws.com/dev/share
functions:
  greteltransform: gretel-s3-proxy-demo-dev-greteltransform
  gretelshare: gretel-s3-proxy-demo-dev-gretelshare

Since we’ve already deployed this in our infrastructure, you can always just test our live endpoint:

$ python test_remote.py

Summary

The new S3 Object Lambda opens a tons of possibilities! We really enjoyed putting together this demo stack because it really shows how the Developer Experience for cloud engineering is improving and of course we are even more thrilled that our own Gretel Transforms can drop in place to a workflow like this!