Synthetic Data, Real Privacy: Automating Secure Workflows with Gretel and Amazon SageMaker

Generate private and shareable data automatically by triggering Gretel jobs in Amazon SageMaker

Summary

At Gretel, we’re excited about empowering developers with the tools they need to address data privacy and security risks that often hamper innovation. By partnering with Amazon, we’re making multimodal synthetic data generation tools and evaluation metrics broadly available, so teams everywhere can collaborate and build great products with safe, shareable data that respects personal privacy.

In this post, we demonstrate how a Gretel job can run seamlessly in the background once a user (or automated process) uploads a CSV file to a particular source S3 bucket with restricted permissions. When set up, the entire pipeline runs autonomously without the need for any manual intervention. The uploaded files could contain sensitive data, so sharing access to this S3 bucket would impose a privacy risk, even across internal teams. Note: For users with stringent security requirements, these processes can be used in Gretel Hybrid mode to ensure their data never leaves their AWS environment.

The S3ObjectCreation event in the s3 source bucket will invoke an AWS Lambda function that starts an Amazon SageMaker Notebook instance. The notebook instance will automatically execute the notebook code via the notebook’s lifecycle configuration.

In the SageMaker notebook, we first retrieve all the uploaded csv files that were added to the restricted bucket (whether this is a single or a bulk upload) and then call the Gretel API to run Gretel Transform and Gretel Synthetics on these files. With Gretel, we first will de-identify the files by removing the PII and subsequently train a synthetic model on the de-identified files, create a Synthetic Data Quality Score (SQS) report and generate a synthetic version of the dataset. The synthetic data and the SQS report will then be uploaded to a destination S3 bucket that could have more permissible access rights.

Internal teams can retrieve the synthetic data stored in the destination bucket without the need to ever see the production data. They don’t even need to run the synthetic model as this process will automatically run in the background. The synthetic version of the sensitive or production data can then be used by data scientists for data analysis, ML engineers for ML model training, or software engineers for test data management use cases.

This synthetic data generation process leverages different AWS services to make the workflow possible, including:

  • Amazon S3 to access data
  • AWS Identity & Access Management to restrict access rights
  • AWS Lambda to trigger based on an S3 Object Create event
  • Amazon SageMaker to run the Gretel SDK
  • AWS Secrets Manager to store the Gretel API Key
  • Amazon CloudWatch to monitor the overall process

The architectural diagram is depicted below, and you can watch an end-to-end demo of this process here.

Figure 1. Architecture for Integrating Gretel with Amazon SageMaker.

Step 1 - Create S3 source and destination buckets

We start by creating two S3 buckets:

  • gretel-source-data-bucket: This is the source bucket where sensitive data or production data will be stored. The data can be uploaded by users with restricted access or an automated process that dumps data into that bucket. Note: S3 bucket names must be unique.
  • gretel-destination-data-bucket: This is the destination bucket and will be used to contain the synthetic data and other artifacts produced by Gretel, such as the Gretel Synthetic Data Quality Report, report summary, and logs. Since the data stored in this bucket is PII-removed and synthetic, we don’t have to worry about privacy and can more broadly share access to the data.
Figure 2: S3 source and destination buckets.

Step 2 - Create Secret for the Gretel API key

We don’t want to expose the Gretel API key or hard-code it in the notebook. So we’ll use AWS Secrets Manager to create a secret Key-Value pair to store the API Key.

Let’s set the Key name to “gretelApiKey” and grab the Gretel API key from the Gretel Console https://console.gretel.ai/

Figure 3: Gretel API key
Figure 4: Create a secret in AWS Secrets Manager for the Gretel API key.
Figure 5: Create Key/Value pair to store the Gretel API key.

Finally, we can use the following code snippet in the notebook to retrieve the Gretel API Key to configure the Gretel session.

def get_secret():

    secret_name = "prod/Gretel/ApiKey"
    region_name = "us-east-1"

    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        raise e

    secret = json.loads(get_secret_value_response['SecretString'])

    return secret["gretelApiKey"]


# Configure a Gretel session

GRETEL_API_KEY = get_secret()
configure_session(api_key=GRETEL_API_KEY, cache="yes", validate=True)

Step 3 - Create SageMaker notebook

Now let’s create the SageMaker notebook instance. In this example, we name the instance gretel-schedule-notebook. Since we don’t need the notebook to do heavy computations, we can select a cost-efficient instance type. Note that all model training will happen in the Gretel Cloud (in this example) or in Amazon EKS cluster (in case you opt for a Gretel Hybrid deployment).

Figure 6: Create a SageMaker notebook instance where the Gretel SDK will be called.

Step 4 - Add notebook to the SageMaker instance

Once the notebook instance is created, we start the instance and open JupyterLab to add the notebook.

The notebook will run the following steps:

  1. Install the gretel-client.
  2. Add the code snippet to retrieve the Gretel API key from Secrets Manager.
  3. Fetch all the uploaded CSV files from the restricted access S3 bucket.
  4. Use Gretel Transform to remove all the PII in the data.
  5. Train a Gretel Synthetic model on the de-identified data files.
  6. Create a Synthetic Data Quality Report.
  7. Generate synthetic data using the synthetic model .
  8. Upload the data quality report and synthetic data to the destination S3 bucket.

The notebook for this blogpost can be found here. We changed the name of the notebook in our workflow to gretel-model-run.ipynb.

Step 5 - Create notebook Lifecycle configuration

Next, within SageMaker, we’ll create a Lifecycle configuration that allows us to run and automatically stop the notebook. The notebook instance will be stopped when the notebook goes idle after a specified amount of seconds.

Figure 7: Configure a lifecycle for the SageMaker notebook instance.

The script to use is the following:

set -e
ENVIRONMENT=python3
NOTEBOOK_FILE="/home/ec2-user/SageMaker/gretel-model-run.ipynb"
 
echo "Activating conda env"
source /home/ec2-user/anaconda3/bin/activate "$ENVIRONMENT"
echo "Starting notebook"
nohup jupyter nbconvert  --to notebook --inplace --ExecutePreprocessor.timeout=600 --ExecutePreprocessor.kernel_name=python3 --execute "$NOTEBOOK_FILE" &
echo "Decativating conda env"
source /home/ec2-user/anaconda3/bin/deactivate

# PARAMETERS
IDLE_TIME=600

echo "Fetching the autostop script"
wget https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/master/scripts/auto-stop-idle/autostop.py

echo "Detecting Python install with boto3 install"

# Find which install has boto3 and use that to run the cron command. So will use default when available
# Redirect stderr as it is unneeded
CONDA_PYTHON_DIR=$(source /home/ec2-user/anaconda3/bin/activate /home/ec2-user/anaconda3/envs/JupyterSystemEnv && which python)
if $CONDA_PYTHON_DIR -c "import boto3" 2>/dev/null; then
    PYTHON_DIR=$CONDA_PYTHON_DIR
elif /usr/bin/python -c "import boto3" 2>/dev/null; then
    PYTHON_DIR='/usr/bin/python'
else
    # If no boto3 just quit because the script won't work
    echo "No boto3 found in Python or Python3. Exiting..."
    exit 1
fi
echo "Found boto3 at $PYTHON_DIR"

echo "Starting the SageMaker autostop script in cron"

(crontab -l 2>/dev/null; echo "*/5 * * * * $PYTHON_DIR $PWD/autostop.py --time $IDLE_TIME --ignore-connections >> /var/log/jupyter.log") | crontab -

Once done, we need to attach the lifecycle configuration to the notebook instance. Note that if we want to make any modifications to the notebook, we’ll have to remove the lifecycle from the instance to prevent auto-stopping.

Figure 8: Add the lifecycle to the SageMaker notebook instance.

Step 6 - Create the AWS Lambda function

Now we create a Lambda function with the function name launch_gretel_nb.

Figure 9: Create AWS Lambda function.

We then add an S3-based event trigger.

Figure 10: Configure the AWS Lambda function to trigger based on S3.

The trigger configuration is set up so the S3 bucket gretel-source-data-bucket is the event source with “object create events” as the event types. For this example, we’ll limit file types to CSV.

Note: Gretel supports other flat data formats, such as json and parquet, and we can always add separate triggers for those file formats. Alternatively, you could also consider using Gretel Relational in your automated pipeline with events triggered from Amazon DynamoDB or Amazon DocumentDB.

Figure 11: Further specify the S3 trigger by a create event for CSV files in the specified bucket.

Once the Lambda function is configured, we add a few lines of Python code to instruct the function to start the SageMaker notebook instance that we named gretel-schedule-notebook.

import json
import boto3
import logging

def lambda_handler(event, context):
 client = boto3.client('sagemaker')
 client.start_notebook_instance(NotebookInstanceName='gretel-schedule-notebook')
 return 0

Step 7 - Attach the right IAM policies to the SageMaker role

We want to make sure that the SageMaker role has access rights to our S3 buckets and the Secret that holds the Gretel API key.

Therefore, we'll create the following policies in IAM and add them to the SageMaker role: 

  • S3GretelPolicy: to allow read/write access to the S3 source and destination bucket.
  • SecretsManagerGretelAPIKey: to give access to the secret holding the Gretel API key

Go to IAM, select Policies and click on “Create policy”. For each of the policies, add the following JSON file and give it the above names.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "S3:ListBucket"
            ],
            "Resource": [
                "arn:aws:S3:::gretel-source-data-bucket",
                "arn:aws:S3:::gretel-destination-data-bucket"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "S3:PutObject",
                "S3:GetObject",
                "S3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:S3:::gretel-source-data-bucket/*",
                "arn:aws:S3:::gretel-destination-data-bucket/*"
            ]
        }
    ]
}

Secrets Manager Gretel API Key

Here, we’ll need to get the ARN from SecretsManager.

Figure 12: Retrieve the ARN of the secret storing the Gretel API key.

Next, we’ll add it to the policy.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetResourcePolicy",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret",
                "secretsmanager:ListSecretVersionIds"
            ],
            "Resource": "arn:aws:secretsmanager:us-east-1:750978791311:secret:prod/Gretel/ApiKey-GdDZfv"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "secretsmanager:GetRandomPassword",
            "Resource": "*"
        }
    ]
}

Once these policies are created, we’ll add them to the SageMaker role:

Figure 13: Add the created IAM policies with restricted access to S3 and Secrets Manager to the SageMaker role.

Finally, the role will look as follows:

Figure 14: List of required IAM permission policies for the SageMaker role.

Now we’re ready to test out our workflow!

Step 8 - Monitor the workflow in AWS CloudWatch

In AWS CloudWatch, we can retrieve the logs to find out each time the Lambda function was triggered. We can also check the logs produced by the SageMaker notebook, which may be helpful in case of debugging.

Figure 15: Watch the trigger events in CloudWatch.
Figure 16: Watch the SageMaker logs in CloudWatch.

Conclusion

In this post, we walked you through the process of setting up an automated workflow that allows you to create synthetic data using Gretel with AWS services such as SageMaker, Lambda, and Secrets Manager. Once configured, the processes run fully autonomously and can be monitored via Amazon Cloud Watch. If you have more stringent security requirements, these processes can be used in a Gretel Hybrid environment to ensure data never leaves your control. Either setup enables the generation of an unlimited amount of PII-free high quality synthetic data that can be safely shared across your organization and with external stakeholders. 

To see an end-to-end demo of the process, check out our webinar recording. The demo starts here, or you can watch the full video.

You can find Gretel on the AWS Marketplace. If you have any questions about Gretel’s platform or would like to learn more about how synthetic data can help your business, send us a note.

About the authors

Maarten Van Segbroeck

Maarten Van Segbroeck is a Principal Applied Scientist at Gretel where he helps customers to adopt synthetic data solutions into their use cases. Prior to Gretel, he spent 5 years at Amazon dividing his time between AWS and Amazon Alexa. Maarten holds a PhD in Electrical Engineering from the Catholic University of Leuven and worked as a post-doctoral researcher at the University of California in Los Angeles. 

Rumi Olsen

Rumi is a global thought leader in the fields of AI and machine learning, with 18 years of experience in the technology industry. With expertise in AWS cloud, solutions architecture, and hands-on software development, Rumi has consistently achieved remarkable outcomes. Beyond her technical prowess, she is an active blogger and video content creator, crafting original ML content for re:Invent and hosting engaging in-person sessions. As she leads a team of AI/ML Partner Solutions Architects at AWS, Rumi collaborates closely with top machine learning ISV partners, leveraging AWS ML services to elevate their products and spearheading strategic engagements.

Qiong (Jo) Zhang

Qiong (Jo) Zhang is a Senior Partner Solutions Architect at Amazon Web Services, specializing in AI/ML. Qiong worked at Fujitsu as a Research Scientist from 2008 - 2021. Qiong received her Ph. D. Degree in Computer Science from The University of Texas at Dallas. She holds 30+ patents and has co-authored 100+ journal/conference papers. She is also the recipient of the Best Paper Award at IEEE NetSoft 2016, IEEE ICC 2011, ONDM 2010, and IEEE GLOBECOM 2005.