Presentation

Synthetic Data, Real Privacy: Automating secure workflows with Gretel and Amazon SageMaker

August 8, 2023

Using Gretel and AWS SageMaker to build automated and secure synthetic data workflows

Read the blog post

Transcription

Speaker 1 (00:02):

Hey everyone. Thanks for joining us today we're going to talk about using SageMaker to build automated and secure synthetic data workflows. I'm Yani Kale and I'm on the product team at Gretel. Joining us are Martin von Seg Brook from Gretel and Joe Zang from SageMaker. We'll start off with some brief introductions and then Joe will go over everything that you can do with SageMaker. Martin will then walk us through the Gretel workflow in SageMaker. We'll make sure to leave enough time for a q and a session. Martin, over to you.

Speaker 2 (00:40):

I am Martin, principal scientist here at Brittle. I lead the customer success team where we actually support customers in their adoption of synthetic data and also in their workflows and also explore how synthetic data can be used in their use cases.

Speaker 4 (01:01):

Hello everyone. Good morning or good afternoon. This is Chong Jam. You can call me Joe. I'm a senior partner solutions architect from AWS specializing in A IML. Now we're going to talk about SageMaker and how SageMaker helps you automate machine learning workflows. Next,

(01:27)
As you know when it comes to ai, ML AWS offers a broad and deep set of services at three layers, AI services, Amazon SageMaker, and the machine learning frameworks and infrastructure. And I would like to highlight our recent generative AI offerings. Here we have Amazon Bedrock. It is the easiest way to build the generative AI powered applications and the leverage foundation models from third parties such as AI to One Labs and Anthropic stability, AI and Amazon. We have services with built-in generative ai such as Amazon Code Whisper. It is an AI code companion that helps you build applications faster and more securely free for individual use. And we have Amazon SageMaker and SageMaker Jumpstart with the flexibility for all customers and we have the most price performance infrastructure for machine learning AWS Tri and AWS Inia combined with EC2 ultra clusters and high speed networking. Today, there are tens of thousands of customers using Amazon SageMaker, making it one of the fastest growing services in the history of AWS.

(02:50)
Next please. Amazon SageMaker is a comprehensive machine learning service enabling business analysts, data scientists, and the machine learning operation engineers to build, train, and deploy machine learning models for any use case. Regardless of machine learning expertise, data scientists can easily prepare data and build, train, and deploy machine learning models. With a SageMaker studio machine learning engineers can deploy and manage models at scale with SageMaker ML ops capabilities and the business analysts can make machine learning predictions using a visual interface with SageMaker Canvas. Next slide. Customers often tell us some main challenges in machine learning and the SageMaker helps you overcome these challenges. One of the main challenges I would like to mention is that data needs to be labeled and prepared in large volumes before it can be used for training machine learning models. This requires massive data processing capabilities and SageMaker ground truth and SageMaker data Wrangler make it easy to label and prepare data for machine learning. Also, GTO's generative AI models for synthetic data generation can improve the data quality for machine learning and the preserved privacy of data. Another challenge that I would like to mention is that let's go back.

(04:24)
DevOp teams have to manually manage disparate environments for prototyping, development, testing, validation and production, making it difficult to maintain consistency across this environment and ultimately delaying deployment of models to production. And SageMaker provides ML ops capabilities including SageMaker pipelines, SageMaker projects, and SageMaker model registry to help you streamline machine learning lifecycle, reducing the operational burden of managing machine learning models in production and accelerating the time to value. Now next slide. Here we show ML ops features and the capabilities in SageMaker. We use SageMaker pipelines to automate different steps of a machine learning workflow. At the data preparation step, we use low-code, SageMaker data Wrangler or SageMaker processing job to transform data. We can use feature store to store share and manage machine learning model features for training and inference and to promote feature re user across the machine learning applications. And we can use Grotto products here to generate the synthetic data for improving data quality and for preserving data privacy. During model training and tuning, we use SageMaker experiments track and find the best performing model and use model registry to register the best model and we then deployed the model to AMIC endpoint and use model monitor to model drift and data drift for models in production.

(06:05)
Now let's dive a little deeper into SageMaker pipelines. Amazon SageMaker pipelines is a machine learning CICD service. CICD stands for continuous integration and continuous delivery. SageMaker pipelines reduces the months of coding required to manually stitch together different code packages to just a few hours and with just a few clicks in SageMaker pipelines, you can create our automated machine learning workflow and pipelines provides comprehensive steps including data preparation, transformation in data wrangler, algorithm setup, debugging, model optimization and more. Pretty much anything you can do in SageMaker you can add to your workflow. Next, let me show you example workflow for model build using SageMaker pipelines, there are four steps in the pipeline, data processing model, training model evaluation, and if the model accuracy is greater than 90% register the model, the input dataset is stored at Amazon S3 bucket. If training data is not available, for example due to privacy requirements, we can use G GTOs generated for AI models to generate a synthetic training data and then send it to our S3 bucket. S3 can then send an event notification to initiate execution of the pipeline later. Martin from G GTO will demonstrate you how to use Gratto to generate the synthetic data and then send it to S3 bucket.

(07:43)
So how to create automated SageMaker pipelines, you first create steps by writing scripts for each step and then you connect the steps by defining a pipeline and once you define the pipeline and you can execute pipeline and you can execute multiple instances of the pipeline concurrently. And there are different ways to trigger pipeline executions. One example, as we mentioned before, we can trigger pipeline execution when a new training dataset is uploaded to a S3 bucket, or you can initiate a pipeline execution regularly per week or per month by using Amazon CloudWatch events and Amazon Event Bridge. Next slide here we summarize the features in SageMaker With these features, Amazon SageMaker helps you prepare data, build, train, and deploy machine learning models in production for any use case. Thank you. Now we'll hand over to talk about how to integrate with a services.

Speaker 5 (09:19):

Hey Mar, I think you might need to unmute.

Speaker 2 (09:22):

Oh, I'm sorry. That's okay. Thanks everyone. Thanks Joe. So before we go into the demo, I like to talk a little bit about synthetic data and what we are doing at Gretel. You want to go to the next slide? Thank you. What is synthetic data? Synthetic data is annotated information that computer simulations or algorithm generate as an alternative to real world data. So synthetic data is not real data. It's derived from real data that has the same statistical properties of the data. Synthetic data is typically generated from a model that is either directly trained or fine tuned in the case of a foundation model, for instance on the real data. Next slide,

(10:06)
Readily a synthetic data company. Three points I like to make here about the benefits of synthetic data. First of all, generally speaking, synthetic data is an answer to making data safe and accessible. If you're able to share data with others, you can collaborate better, innovate faster, ultimately improve productivity. Concept of synthetic data applies to a wide variety of projects and business use cases. The way we generate data, synthetic data at Gretel is by first trading a synthetic data model on your data to learn the statistical properties of the underlying data. Once you have that model, you cannot only generate as many records as you want, more than that it's present in the production data but also to conditionally generate from the model. What that means is that you can boost a minority class or create various data sets in parallel, covering multiple scenarios, scenarios that you might not have seen before in your production data but you can anticipate on seeing in the future.

(11:09)
We can use those data sets then to using our lower depth environments to make sure that things are not breaking, for instance. Thirdly, synthetic data also plays a key role in what we believe is lost mile training, especially in the current trend of moving to generative AI such as large language models or stable diffusion models. What this means is that these generative models are trained on public open domain data, but to make them really work for your use cases, you need to fine tune them on your custom production and highly sensitive data. And the safest way to do that is to use synthetic data derived from that production data, allowing you to have a safe barrier between production data on one side and the machine learning model on the other side and preventing any leakage of PI information in that model. Next slide please. What does it mean to you to basically collaborate faster, innovate faster, productionize faster? It means thatl basically compresses the time for you to reduce the various bottlenecks that exist in workflows that typically run in production environment. The fact that you don't have to go through the process of collecting the data yourself, it can be quite expensive and anonymize the data that already brings a huge win to your business and Gretel can help with that.

(12:38)
Next slide please. Here I just want to show that we're not upselling things. This is a study done by Gartner in 2021, basically predicting that by 2030 the majority of the data that is used for machine learning model training will be synthetic data. And at Gretel, we already see that paradigm shift happening by talking to our customers. My opinion comes from two reasons. First of all, continuous improvement in synthetic data modeling, closing the gap between synthetic data on one side and the real data on the other side while preserving that privacy guarantee. Secondly, there's an increasing awareness by business leaders that synthetic data is a safe alternative for real data and in some cases it's your only choice. Think about the example of healthcare where patients don't want to share their data for machine learning model training and then synthetic data still offers them a solution to still allow them to train model machine learning models.

(13:41)
Next slide please. So how do this at Gretel explaining this In a nutshell, we are offering a fully comprehensive and multimodal synthetic data platform that support tabler data, text data, relational dataset, image time series, and all these models that we are using under the hood are using the latest state of the art technology going from GaN based models to foundation models in generative AI space. On top of that, we are also building additional features like validators that check for data quality of the synthetic data is guaranteed that for instance we're not sensitive to hallucinations in the synthetic data. We're also building privacy filters and privacy techniques within those models. One example is or tabular DP model which uses differential privacy which guarantees also that the synthetic data is completely safe and secure to use. We'll use the Tabler DP model in the demo that I will give later on. One last point before going to the demo is that Gretel is available both in the Gretel cloud, meaning that all their compute and outer scaling will happen or taken care by Gretel, but also in a hybrid development where everything runs in your cloud environment and we're supporting all the major cloud Azure, Google cloud and AWS. This actually ensures that none of your data will actually leave your environment.

(15:19)
So now going to the demo first I'll show you a quick diagram about our workflow. So here you go. So we're going to show you here how we can leverage all the different AWS services to build a secure workflow using SageMaker On the one hand and Gradle on the other hand, first of all you have we start with generating, creating a source bucket. That source bucket is the data where your production data and your sensitive data will be stored that that bucket. You want to have very restrictive access right to your organization. On the other hand, you have a destination bucket where all the synthetic data will go in and additional artifacts of the synthetic data.

(16:14)
That bucket, you can give it more permissive access, right? So that you can share that bucket with data scientists, machine learning engineers or machine learning scientists to use that data. Gretel is available to an SDAK, so we could actually call Gretel in a SageMaker notebook and that SageMaker notebook that we are going to leverage, we'll take the data from the S3 source bucket, we'll train a synthetic model on that data and then generate synthetic record from that model and then stores it in the destination bucket. Now to make everything automated, we added a lambda function that basically will get triggered as soon as the data file and we specify in this case A CSV file. So only files with the CSV extension that getting uploaded to the source bucket will trigger the lambda function and the lambda function will then execute and start the SageMaker notebook instance.

(17:11)
That upload can done manually as I will show later or can be done by an automated process. Just anytime there is a CSV file stored or uploaded in the source bucket, this workflow will kick off. To make this also work, we also have to attach lifecycle configuration to the notebook that allows you to start the notebook and also stops the notebook When all the processing is done, the notebook goes to an idle state and just spin down the instance on which the notebook is running. Cradle also make use of an API key and Amazon actually a very neat solution to deal with API keys rather than manually giving your API key or hard coding, the API key, you can store the API key to a WS secret manager and then give specific permissions to the SageMaker role to access the API key. So nobody has to see that API key just has to be configured.

(18:08)
Once this is set up, the entire workflow will run completely autonomously without need for manual intervention. However, you can always check the logs and when the lambda function is triggered by looking at the Amazon cloud watch much. I'll go now to showing you this in a live demo but also want to mention that we will release a blog post tomorrow where all the steps are nicely described and also any code or scripts that have been used in this notebook. So it says in this workflow, so it says a notebook and policies that we set up will also be available so you be able to just reproduce everything that I'm actually going to walk you through.

(18:59)
So we'll go over to the screen share here. So this is where we start. We start in S3, we create, like I said, a source data bucket and a destination bucket. We can make those buckets. We can customize the permission rise to those buckets. As I said before, the source data bucket might have very permissive data access, right? Because they're the production data will be stored. Destination bucket is the bucket that will be shared with your broader organization, internal teams or whatnot. Next we set up a great SageMaker notebook called it Gretel Schedule Notebook. Here it's currently in a stopped state. If we go to this notebook here, we see that I attached the lifecycle configuration to that notebook that basically will just spin up and spin down out to stop the cluster checks for a 10 minute idle state. So all configurable and then we'll take the notebook down when all the processing is done. Next we also have to set up a lambda function. So this is basically how we do this in AWS. We specify the lounge Gretel notebook is the name of the Lambda function and it gets triggered by an S3 event. And as you can see here, that event type is by an object creation upload of a file for instance, and it only gets triggered when a file has a suffix of a CSV file.

(20:33)
Next step that I like to show is that we want to give the Amazon SageMaker notebook. So we go to AWS Im, we want to give it permission to either the two S3 buckets, so no access to other S3 buckets, just the destination and the source data bucket. And then also another policy for secret keyword, the API key stored. So we don't give it access to the entire secret manager, just access to the specific API key of rel. So these are the policies we need to attach to the SageMaker role. The secret where we actually store the API key can be done in a secret manager. It's actually a key value pair and that can then neatly be retrieved in the SageMaker notebook. So when we go to the notebook here, the notebook is currently in a stopped state, but I actually created the same notebook in SageMaker studio lab and I'm showing it here.

(21:40)
Like I said, I will share this but I will quickly co-author it. We start by installing the Gretel client here, import some functions. We then grab the file that were uploaded to the S3 bucket and then we retrieve the data. We check if a file has already been processed or not and then we go to the step of running Gretel, transform and synthetics. Before we do that, we need to configure the session. So here we have actually get the secret from the AWS secret that will then get the Gretel API key. So no hard coding here required. We just call the secret a W secret to get our API key and then we first transform the data and synthesize data. Transform is a step that is needed to just de-identify the data to removes all the PI information that you have available in your data out, detect it and then redact it, fix it.

(22:42)
So we want to make sure that you do this step before training synthetic data to if you want to fully ensure that no PI gets leaked into your synthetic data. And then we train a synthetic data model. In this case we use our Tableau differential privacy model. It has an absolute of one, which means that the privacy is absolutely guaranteed here. First we train the model and then we generate synthetic data from the model. So that's all done. Once we have that, we basically will store generate a synthetic data in the destination bucket. We also will store a report. Gretel offers a synthetic data quality report that will, I will show you that in a second. It'll also be stored in the destination bucket together with a summary of all the synthetic data quality scores.

(23:39)
So let's go to our source data bucket and actually just show you how it is done. I've already processed hospital records in this case from week six and week seven of 2023. So now let's add another file. Like I said, this can be done manually or can be an automated process that does that. So as soon as this gets uploaded to the S3 cluster, there we go, it's now here. Then we should go to our Gretel notebook and it immediately goes into a pending state. It was stopped before called spending stage so that basically the pipeline is now kicked off. It'll take a couple of minutes to provision the instance and then run it first, transform, then synthetics on that. It'll take I think almost 20, 30 minutes. So we're not going to see for this file in this time the results, but like I said, I already run this for week six and week seven of data.

(24:56)
So you can see here that we first have, if we go to our Gretel console, so where you can also follow how things have been processed, you can see the transform function is there and the synthetic data has been stored. So these are two projects here that Gretel generates. It first creates a model, can see all the activity here. This was a dataset, the de-identified dataset that we hear. We can see Greta provide all the locks here so you can follow along how the Tabler DP model has been trained on your data. And then once that is done, you can see a synthetic data quality score in a report. The report that GRELIN generates looks something like this. In this case we have a great score, 93 and an excellent privacy. We also provide deeper details in those scores. We can see for each of the columns in the file whether or not the quality is good.

(25:58)
We can see data correlations between the different columns. We see the deep structure of the data by means of principle component analysis and then also see how the distributions for each of these columns compares to the original source data sets. So that is the workflow. Just one more thing is that you can follow this entire process also by going to CloudWatch. Everything that is running under the HOO right now is actually live to follow in the log files as well. So just to summarize here, it's a completely automated workflow, doesn't need any manual intervention. If you want to change anything to the notebook, obviously can that if you want to make some process changes, but this is basically if you have set that up, the only thing you have to do is upload the file and then synthetic data can automatically generated, it'll end up in the destination bucket. And here you can see that we have synthetic data and the report getting stored there and that data can then safely be shared by other members of your organization. That is the demo here over to yai.

Speaker 1 (27:24):

Thanks Martin. And Joe, that was super interesting. Quick note before we go to q and a. The use case we saw today requires training data to be available, but sometimes there's no data available at all of certain events. We've been working for the past several months on a tabular language model that will address these needs. If this is a constraint that you're familiar with, check out the demo using the link or the QR code and then you can also sign up for early access there. I also wanted to say that Martin's demo kind of walked through one example, but you can do various things. In addition, we also support different data types such as Jason and Parquet. So have a look around, you can sign up for Gretel. We have a generous free plan over to q and a. Za, if you could show us the first question. I think that was for Joe, but Joe already answered it in the comments thread. Joe, do you want to give us a quick explanation again of what you answered Arthur?

Speaker 4 (29:02):

Okay, yeah. Regarding SageMaker Notebook or SageMaker, it's an integrated machine learning development tool and you can pretty much do any machine learning workloads on it and you can do data preparation using different services available. You can just use it like a Jupyter Notebook, that's the basic. And then you can use a data wrangler for the data preparation and you can use a feature store for store the features that you already manipulate and prepared like that and you can do training or even distributed training and SageMaker can automatically actually launch a distributed training cluster on the cloud for you. You do not need to dive deep into how to build the infrastructure from a distributed training and also for inference. And SageMaker also can help you manage your inference endpoint for any scale. So the relation with gretto and the SageMaker is that Gretto basically you can generate the synthetic data and then you can use the dataset from gretto and then to do your regular machine learning workloads on SageMaker.

Speaker 1 (30:19):

Thank you for that explanation, Joe. Next question. This is given your S3 CSV uploaded event trigger, what happens if your training data consists of multiple CSVs? Wouldn't the SageMaker pipeline get executed Redundantly Martin? This one I think's for you.

Speaker 2 (30:39):

So currently if your data, you can do a bulk upload of the CS V. So if you have multiple CS V files, you can also upload that. I will not do any redundant computation here because we are actually also checking if a file has been processed or not. I have to see the question again for the last part. If you want to show it quickly

Speaker 3 (31:11):

Speaker 2 (31:14):

No, so it wouldn't be run out redundantly and we will share the notebook with you so you can actually verify that on your side that everything will taken care of on that part.

Speaker 1 (31:34):

Question for you again, Martin, briefly, how are the quality metrics computed?

Speaker 2 (31:39):

Yeah, so we are actually computing three types of quality metrics in our report. One has to do with field correlation stability, so seeing how are each columns in the correlation between each columns that is in the synthetic data and how it actually refers to in the original data. Secondly is the deep structured quality. We are actually doing a PCA analysis on the real data and on the synthetic data and we compare or we have distant metrics between the PCA components. And lastly is the field distribution stability where we actually measure how the histograms for every of those fields are matching against each other. For more details, I would suggest to read to our Gretel docs or reach out offline where we can actually go to deeper mathematical concepts if that's what you want to hear. But that's just in a high level overview how we compute those.

Speaker 1 (32:51):

We also have some links that we can share on that sort of dive more in. Feel free to check out our blog. We've got a lot of content there that dives into how the reports are generated and the quality metrics. Next question for Joe. Does SageMaker handle the scaling of notebook resources dynamically and how do you assess performance?

Speaker 4 (33:15):

So when you create a SageMaker notebook instance, you need to select a computer instance type based on what you want to do in the notebook. So actually a best practice when using a SageMaker notebook is to use the notebook to orchestrate other AWS services for machine learning training and inference. Actually, you can see from Martin's demo, he uses SageMaker notebook to orchestrate a MDA function, orchestrate a GTO cloud synthetic data generation. So one example also for using notebook, you can launch a SageMaker distributed training with GPU instances and then SageMaker can automatically stop the GPU instances once training is finished. So you only pay what you use or you can use notebook to launch SageMaker endpoints for inference. And then SageMaker endpoints supports a automatic scaling and that means when the inference workload increases, the auto scaling brings more instances online and when the inference workload decreases, auto scaling going to remove the unnecessary instances that you provisioned and then you can save computer costs there.

Speaker 1 (34:29):

Great, thank you. Thank you Joe. Next question, Martin, what is synthetic data well suited for and what are its limitations?

Speaker 2 (34:39):

Yeah, that's a good question. So first of all, synthetic data accelerates your development, right? So that's very well suited for that. So I wish to share data, collaborate better, innovate faster, productionize faster, and ultimately monetize faster. You can use synthetic data to establish privacy and fairness of your data. You can be using to augment your production data or real data with synthetic data to, like I said, boost minority classes or creating scenarios that it has not been seen by production data. So if you don't want to replace your production data or real data, you can augment that data with synthetic data sets in terms of limitations, synthetic data speeds up the research pipeline, but in some cases it may not be a full replacement of your real data. So ultimately any final tools that you can deploy for a production model that's going to be live in production, we suggest you to evaluate it, fine tune it and assess it against the real data.

(35:44)
Take the example of a machine learning production model that you want to train on customer data, but you have let's say a 30 day data policy in place. So you want to build the model with the highest accuracy and let's say you can responsibly use real data from the last 30 days, then you would use that data, but you can also augment that with synthetic data to enrich the last 30 days to create scenarios that it has not been seen, but also you don't have access to the data from the previous month, but you can use the synthetic version of that data so you can use real data to get the highest accuracy and augment that further with synthetic data. But also note that although there might be in some cases a utility loss between real and synthetic data, in some cases you don't have an option. The only way you can actually train a machine learning model on your production data is through synthetic data because it gives you that safe barrier and that's something that sometimes we see with healthcare customers. I hope that answered the question. Next question

Speaker 1 (36:58):

Is again, for Martin, what's the benefit of using Gretel with SageMaker versus doing my own, or I guess both of you can answer this one versus doing my own data augmentation to create synthetic data on SageMaker. So basically why not do everything in SageMaker versus using Gretel in SageMaker?

Speaker 2 (37:18):

Yeah, so I think I can take that one. Yeah, so there are open source available libraries to create synthetic data or resample your data. You can build your own code to create synthetic data. But at Gretel we are actually developing what we synthetic data. There's a lot of development going on in synthetic data training, synthetic data models and generating synthetic data is a discipline by itself and we are taking care of that for you to make sure you don't have to reinvent the wheel firstly. And also to make sure that your synthetic data that you generate is safe. You can resample your data set, but that doesn't mean it is safe synthetic data. So that's why Gretel comes in with all the expertise that we have on with our privacy expert to create the models that don't even guarantee you to have high quality, but also use things like technologies, like differential privacy to generate those models in those data sets in a responsible way. So I would suggest that you can try out different things, but also use Gretel and compare your results with Gretel and use tools to compare the privacy and the accuracy because that's a trade off. And lastly, Gretel, you can freely sign up with Gretel, meaning that you can test out and see it for yourself and how things compare to your own data documentation scripts that you write.

Speaker 1 (39:12):

I think there was a follow-up question on the ensuring the quality of the synthetic data,

Speaker 2 (39:21):

Okay, does Gradle have any way to ensure the quality of synthetic data? Meaning how do we know whether the synthetic is made reflecting the reality instead of out of the reality? So this is what I already mentioned about the synthetic data quality report where we compare real red synthetic data that actually gives you already a sign of how good your synthetic data is in terms of statistical properties. Additionally, Gretel also offers the evaluate report, meaning that let's say your use case is to train synthetic machine learning model on your synthetic data. So we evaluate use case, we can compare how a machine learning model that is strengthen your real data would match a machine learning model trained on the synthetic data. So it also depends on how your use case is. So if ML utility is something you want to test your data on, then having that comparison is something that is handy and that's what Gretel provides.

(40:32)
Additionally, we also have privacy filters and other validators in place, meaning that we want to make sure we don't, for instance, outlier filtering. We want to make sure that if there are outliers in the training data, those outliers are not present or synthetic is not going to create those outliers because they are risk for the identification. Also, similarity filters, we don't want to make sure that we want to make sure the synthetic data is not matching the original training data or some records of the original trade data. So you have a tuning knot that you can choose for accuracy versus privacy trade off that we also offer. I hope that answered the question.

Speaker 1 (41:31):

I think this is again, for Martin, is the synthetic data consistent on a data point or row basis? That is, do the data points make sense when looking at them individually?

Speaker 2 (41:44):

To some extent, yes and no. If you have, let's say you have a certain identification number and you want to compare that in your source data and then in your synthetic data you have the same role with a certain identification number, the metadata in all the columns of that role can be different. It is better to compare on an aggregated level the production data with the synthetic data rather than do it on a row by row basis because things can be changed over time as long it's still valid to do the roll by row comparison to make sure that, for the example I gave where one record has a certain ID in your production data and then in the synthetic data, the same ID to make sure that all the other columns, for instance, are in the range of acceptable values, right? So that still makes sense, but in general, you need to compare synthetic data with production data at the aggregate level. And the way to actually to explain is the best is you can generate unlimited amount of data from your production model. So you have thousand records in your production data, you create a million records of your synthetic data, it's hard to compare record by record because you can create larger volumes as well with synthetic data.

Speaker 1 (43:22):

Great, thank you. This next question is for Joe. Can SageMaker automate hyper parameter tuning?

Speaker 4 (43:33):

The answer is yes. Actually there are two SageMaker features that can automate hyper parameter tuning. One is SageMaker automatic model tuning, and another one is SageMaker autopilot. Both features are able to tune machine learning problems. For example, if it's binary or classification based on your data and the algorithms and also tune hyper parameters. But in addition, actually autopilot covers extra step of data pre-processing for tuning. Thank you.

Speaker 1 (44:10):

Thanks Joe. Do we have time Za for one last question? Yep. Looks like we have a last one from Arthur. Can we use Gretel evaluating to evaluate tools on synthetic data not generated using Gretel? Does gretel's quality control algorithms, are they usable separately from the data generation?

Speaker 2 (44:42):

Yes. Great. So we have our synthetic data quality report tool and we have our evaluate tool for ML utility. You can call these standalone libraries and you can provide any other, you can compare any two data sets with each other, whether or not they're generated by griddle or by your own tools. So actually a good way to actually compare how Gretel performs against what you have been using before.

Speaker 1 (45:13):

Yeah, I also dropped in a comment about using both evaluate and benchmark. We can share links in a follow-up email. I think we're at time now. Thank you everyone for joining. And as Martin mentioned, you can sign up for Gretel and use it for free. And please also check out SageMaker. Joe, any last thoughts before we leave?

Speaker 4 (45:51):

So it has been a great webinar and thank you very much for inviting me here.

Speaker 1 (45:57):

Thank you for joining us. That was great. Thank you Martin and Joe, and thank you all for joining.

Speaker 2 (46:03):

Thanks everyone.

Speaker 4 (46:04):

Thank you.

‍

Synthetic Data, Real Privacy: Automating secure workflows with Gretel and Amazon SageMaker

Video description

More Videos

Generating differentially private synthetic text using Gretel GPT

Anonymize Financial Data with a Fine-Tuned SLM

Video - Generate synthetic data for training LLMs and SLMs

Read the blog post

Transcription