Synthetic Data, Real Privacy: Automating Secure Workflows with Gretel and Amazon SageMaker
At Gretel, we’re excited about empowering developers with the tools they need to address data privacy and security risks that often hamper innovation. By partnering with Amazon, we’re making multimodal synthetic data generation tools and evaluation metrics broadly available, so teams everywhere can collaborate and build great products with safe, shareable data that respects personal privacy.
In this post, we demonstrate how a Gretel job can run seamlessly in the background once a user (or automated process) uploads a CSV file to a particular source S3 bucket with restricted permissions. When set up, the entire pipeline runs autonomously without the need for any manual intervention. The uploaded files could contain sensitive data, so sharing access to this S3 bucket would impose a privacy risk, even across internal teams. Note: For users with stringent security requirements, these processes can be used in Gretel Hybrid mode to ensure their data never leaves their AWS environment.
The S3ObjectCreation event in the s3 source bucket will invoke an AWS Lambda function that starts an Amazon SageMaker Notebook instance. The notebook instance will automatically execute the notebook code via the notebook’s lifecycle configuration.
In the SageMaker notebook, we first retrieve all the uploaded csv files that were added to the restricted bucket (whether this is a single or a bulk upload) and then call the Gretel API to run Gretel Transform and Gretel Synthetics on these files. With Gretel, we first will de-identify the files by removing the PII and subsequently train a synthetic model on the de-identified files, create a Synthetic Data Quality Score (SQS) report and generate a synthetic version of the dataset. The synthetic data and the SQS report will then be uploaded to a destination S3 bucket that could have more permissible access rights.
Internal teams can retrieve the synthetic data stored in the destination bucket without the need to ever see the production data. They don’t even need to run the synthetic model as this process will automatically run in the background. The synthetic version of the sensitive or production data can then be used by data scientists for data analysis, ML engineers for ML model training, or software engineers for test data management use cases.
This synthetic data generation process leverages different AWS services to make the workflow possible, including:
- Amazon S3 to access data
- AWS Identity & Access Management to restrict access rights
- AWS Lambda to trigger based on an S3 Object Create event
- Amazon SageMaker to run the Gretel SDK
- AWS Secrets Manager to store the Gretel API Key
- Amazon CloudWatch to monitor the overall process
The architectural diagram is depicted below, and you can watch an end-to-end demo of this process here.
Step 1 - Create S3 source and destination buckets
We start by creating two S3 buckets:
- gretel-source-data-bucket: This is the source bucket where sensitive data or production data will be stored. The data can be uploaded by users with restricted access or an automated process that dumps data into that bucket. Note: S3 bucket names must be unique.
- gretel-destination-data-bucket: This is the destination bucket and will be used to contain the synthetic data and other artifacts produced by Gretel, such as the Gretel Synthetic Data Quality Report, report summary, and logs. Since the data stored in this bucket is PII-removed and synthetic, we don’t have to worry about privacy and can more broadly share access to the data.
Step 2 - Create Secret for the Gretel API key
We don’t want to expose the Gretel API key or hard-code it in the notebook. So we’ll use AWS Secrets Manager to create a secret Key-Value pair to store the API Key.
Let’s set the Key name to “gretelApiKey” and grab the Gretel API key from the Gretel Console https://console.gretel.ai/
Finally, we can use the following code snippet in the notebook to retrieve the Gretel API Key to configure the Gretel session.
Step 3 - Create SageMaker notebook
Now let’s create the SageMaker notebook instance. In this example, we name the instance gretel-schedule-notebook. Since we don’t need the notebook to do heavy computations, we can select a cost-efficient instance type. Note that all model training will happen in the Gretel Cloud (in this example) or in Amazon EKS cluster (in case you opt for a Gretel Hybrid deployment).
Step 4 - Add notebook to the SageMaker instance
Once the notebook instance is created, we start the instance and open JupyterLab to add the notebook.
The notebook will run the following steps:
- Install the gretel-client.
- Add the code snippet to retrieve the Gretel API key from Secrets Manager.
- Fetch all the uploaded CSV files from the restricted access S3 bucket.
- Use Gretel Transform to remove all the PII in the data.
- Train a Gretel Synthetic model on the de-identified data files.
- Create a Synthetic Data Quality Report.
- Generate synthetic data using the synthetic model .
- Upload the data quality report and synthetic data to the destination S3 bucket.
The notebook for this blogpost can be found here. We changed the name of the notebook in our workflow to gretel-model-run.ipynb.
Step 5 - Create notebook Lifecycle configuration
Next, within SageMaker, we’ll create a Lifecycle configuration that allows us to run and automatically stop the notebook. The notebook instance will be stopped when the notebook goes idle after a specified amount of seconds.
The script to use is the following:
Once done, we need to attach the lifecycle configuration to the notebook instance. Note that if we want to make any modifications to the notebook, we’ll have to remove the lifecycle from the instance to prevent auto-stopping.
Step 6 - Create the AWS Lambda function
Now we create a Lambda function with the function name launch_gretel_nb.
We then add an S3-based event trigger.
The trigger configuration is set up so the S3 bucket gretel-source-data-bucket is the event source with “object create events” as the event types. For this example, we’ll limit file types to CSV.
Note: Gretel supports other flat data formats, such as json and parquet, and we can always add separate triggers for those file formats. Alternatively, you could also consider using Gretel Relational in your automated pipeline with events triggered from Amazon DynamoDB or Amazon DocumentDB.
Once the Lambda function is configured, we add a few lines of Python code to instruct the function to start the SageMaker notebook instance that we named gretel-schedule-notebook.
Step 7 - Attach the right IAM policies to the SageMaker role
We want to make sure that the SageMaker role has access rights to our S3 buckets and the Secret that holds the Gretel API key.
Therefore, we'll create the following policies in IAM and add them to the SageMaker role:
- S3GretelPolicy: to allow read/write access to the S3 source and destination bucket.
- SecretsManagerGretelAPIKey: to give access to the secret holding the Gretel API key
Go to IAM, select Policies and click on “Create policy”. For each of the policies, add the following JSON file and give it the above names.
Secrets Manager Gretel API Key
Here, we’ll need to get the ARN from SecretsManager.
Next, we’ll add it to the policy.
Once these policies are created, we’ll add them to the SageMaker role:
Finally, the role will look as follows:
Now we’re ready to test out our workflow!
Step 8 - Monitor the workflow in AWS CloudWatch
In AWS CloudWatch, we can retrieve the logs to find out each time the Lambda function was triggered. We can also check the logs produced by the SageMaker notebook, which may be helpful in case of debugging.
In this post, we walked you through the process of setting up an automated workflow that allows you to create synthetic data using Gretel with AWS services such as SageMaker, Lambda, and Secrets Manager. Once configured, the processes run fully autonomously and can be monitored via Amazon Cloud Watch. If you have more stringent security requirements, these processes can be used in a Gretel Hybrid environment to ensure data never leaves your control. Either setup enables the generation of an unlimited amount of PII-free high quality synthetic data that can be safely shared across your organization and with external stakeholders.
About the authors
Maarten Van Segbroeck
Maarten Van Segbroeck is a Principal Applied Scientist at Gretel where he helps customers to adopt synthetic data solutions into their use cases. Prior to Gretel, he spent 5 years at Amazon dividing his time between AWS and Amazon Alexa. Maarten holds a PhD in Electrical Engineering from the Catholic University of Leuven and worked as a post-doctoral researcher at the University of California in Los Angeles.
Rumi is a global thought leader in the fields of AI and machine learning, with 18 years of experience in the technology industry. With expertise in AWS cloud, solutions architecture, and hands-on software development, Rumi has consistently achieved remarkable outcomes. Beyond her technical prowess, she is an active blogger and video content creator, crafting original ML content for re:Invent and hosting engaging in-person sessions. As she leads a team of AI/ML Partner Solutions Architects at AWS, Rumi collaborates closely with top machine learning ISV partners, leveraging AWS ML services to elevate their products and spearheading strategic engagements.
Qiong (Jo) Zhang
Qiong (Jo) Zhang is a Senior Partner Solutions Architect at Amazon Web Services, specializing in AI/ML. Qiong worked at Fujitsu as a Research Scientist from 2008 - 2021. Qiong received her Ph. D. Degree in Computer Science from The University of Texas at Dallas. She holds 30+ patents and has co-authored 100+ journal/conference papers. She is also the recipient of the Best Paper Award at IEEE NetSoft 2016, IEEE ICC 2011, ONDM 2010, and IEEE GLOBECOM 2005.