Presentation

Improving AI Accuracy and Reducing Bias in Healthcare with Synthetic Data

June 8, 2023

How synthetic data is helping improve AI accuracy, accelerating the speed to development, and reducing the time to data for critical healthcare applications

Read the blog post

Read the post

Transcription

Speaker 1 (00:02):

Hi, my name is Za. I work in product marketing at Grid ai. I'm super excited to welcome you all here today as you join us in this livestream as we explore applications of healthcare and how synthetic data is helping improve AI accuracy, accelerating the speed to development, and reducing the time to data for critical healthcare applications. I'm super humbled to be joined here today with two of my current Gretel colleagues. Unfortunately, we're not joined here today with Nicole Pang, senior product manager. She caught the flu, but we hope she'll be well soon. But I'm equally excited to welcome Yani Kale, head of Product Growth and Kendrick Boyd, principal machine learning scientists. So I'm going to hand it over to them and we're going to get started.

Speaker 2 (00:54):

Thanks, Marthaa. Appreciate it. As everyone, as Za said, my name is Yaman CarGel. I'm on the product team here at Gretel. I just wanted to briefly start off by talking about some of the opportunities and challenges that are faced by AI today in the healthcare industry. Next slide please. And one more. Thank you. So everyone's probably heard all the buzz about ai. It's got the potential to really revolutionize a healthcare. It can be used for things like disease diagnosis, drug discovery, predictive medicine, medical image analysis. There was recently an article by the World Economic Forum that said a future mark by generative AI technology with Usher Healthcare into a new era of innovation. Morgan Stanley predicts that the market opportunity is $6 trillion here over the next few years. Next slide please.

(02:10)
So one of the issues with while AI can be really life changing, it does require large amounts of data. Everyone's probably heard of open AI and G PT 3G, PT four. Those models were trained on 175 billion parameters. GP GT four was tuned reportedly on 1 trillion parameters. So this is all big data and even that can be really challenging and expensive. Now, when you consider domain specific data, it becomes that much harder to get. Healthcare data has its own challenges because it's highly regulated, it contains a lot of private data. P-H-I-P-I-I, you have HIPAA regulations. You have the risk of that data being exposed. So you have to be doubly careful. Data can be sparse, so you can't always collect clinical data at scale. Research studies don't typically output large data sets, and then the data can also be imbalanced. For example, you might have insufficient diversity in your dataset. You might not have enough positive data points to successfully train a model. This can actually create some serious issues in ai. Next slide please.

(03:49)
Here we've got a couple of studies that came out recently that sort of highlighted the issues with having poor training data for AI models. In one example here on the right, the model got really good at detecting malignant tumors, but then it was found that what it had actually done was built an association that all the images with measuring tapes in them were those of tumors. So it wasn't actually, it hadn't achieved diagnostic accuracy. It had just found a pattern that the researchers hadn't really thought about. This was true of the other example in the left as well. The positive pneumonia data sets all came from a particular hospital department and the model learned to associate the X-rays from that department with the positive data point and that created a set of false accuracies. Next slide, please.

(05:05)
You've all probably been seeing these articles recently. There has been a lot of data that has come out about racial bias being found in algorithms. Most of these are caused by lack of diversity in the data set. You might not have enough examples of a particular age group or a particular gender or race. And when you train models on these data sets, you end up with the clients being magnified. Next slide please. So what is synthetic data and how can it help? Synthetic data is artificial data that simulates real world data with all its distributions and correlations. It can also augment real data. So as we've just established, this is a big problem in healthcare where you just don't have enough data to train your model. And synthetic data can be used to augment the real data to reduce the occurrence of these biased AI models.

(06:11)
I Next slide please. Matza. Thank you. So what can you do with synthetic data? You can use synthetic data to de-identify P-H-I-P-I-I. You can make the data differentially private so that it's not open to adversarial attacks. It cannot be reversed engineered. So your data is really safe and secure. You can use it to balance bias data sets, boost minority classes. We actually have a great Python notebook that was released just recently. It's in our GitHub repository that shows an example of boosting minority classes. If anyone's interested, please send us an email and we'd be happy to send that to you. That's hi@gretel.ai. Moving on, you can use synthetic data to generate data sets that are large enough to train ML models and to generate data that's compliant with regulations. Next slide, please Matza.

(07:18)
The benefits of synthetic data are manifold. You can use it for sharing collaboration and innovation. You can use it to generate infinite volumes and variations of your data. The real value actually is in the last mile training, which is being able to train or fine tune this data on your own domain specific, fine tune the models on your domain specific data, which can make it really relevant to your own use cases. Next slide please. So don't take our word for it. Gartner predicts that by 2030 synthetic data will be, it'll completely overshadow real data and we are sort of poised now to see that really take off. That's it from me. I will hand it over to Kendrick to dive into the modalities, our use cases and also provide some real life examples.

Speaker 3 (08:30):

Thank you very much. And go ahead. Next slide please. Yeah, so to start with, synthetic data can be created for lots and lots of different types of data in this range from sort of tabular data that you might think of as a spreadsheet or something to relational data. Or you have multiple tables that have linked information in 'em. Certainly you can create synthetic images, text visit summaries and that sort of thing. And also time series from sensor readings of various types and sort of all of these really show up in the healthcare data sets and EHR and such. And you can combine these modalities and have synthetic data that's creating EHR records that are also linked to x-rays from this hypothetical patient and stuff where the correlations and relationships between those are all present.

(09:33)
Go ahead and go to the next slide please. So we can create all the synthetic data, but in addition to that, it's not just sort of random uncorrelated values. We're able to preserve the same correlations, same distributions that are in the real world data. I can see in some examples here of training and synthetic data properties. The training data's on the left side, synthetic data is on the right side and there's certainly some trade-offs here that you can have a simple faster model that you might use to create your synthetic data, but that's going to have lower fidelity. The quality of these relationships is not going to be preserved as well. But then if there's use cases where you really need really excellent preservation of those properties, then you can use a more complex, a slower model that may take longer to train, longer to develop longer to generate data from, but gives you really high quality synthetic data. And so depending on your particular use case, you can make these trade off and then also sort of fine tune the exact model you're using, how you're using it to your particular use case to get the best results depending on what the goals are. And next slide please.

(10:45)
And so just as a quick example of what is the sort of quality that you can get from synthetic data, here's taking real data and then synthetic data and comparing the performance of a sort of downstream task that you might have a use case. So in this case or training a classification model on the real data and then training another one on the synthetic data. And so we have a stroke data set here plus several other tasks from other domains. And across these different examples, this is just sort of using an off the shelf type synthetic data model With default settings, you can get the accuracy of a classification model trained on the synthetic data is within 2.6% on average of what you would get from the real world data. And then again, you can further refine the way you're using the synthetic data for your use case to continue bumping that up if having the very best data quality is important for your situation.

(11:49)
And so now we'll dive a little bit deeper into a particular example about reducing bias and improving ML accuracy. And so here we'll be looking at augmenting the training data with synthetic data for a heart risk heart disease prediction task. Go ahead and move to the next slide please. So heart disease obviously a major challenge in the world today. And being able to predict and identify who's higher at risk, who's lower at risk and is really important, is critical to the ultimate health outcomes and such. And one of the challenges that sometimes happens here is that we often have less data about certain demographics, certain underrepresented groups and such, and that often leads to poor health outcomes as well. Just because we aren't as familiar, we don't know what's the best way to treat, what's the best way to diagnose in such in certain groups.

(12:55)
And so for this particular investigation here, we're going to be just looking at male and female and that the dataset that we have here only has 32% female, even though the general population there's 50% or I think actually a little bit higher than 50% female. And obviously you can extend this to additional demographics, other situations beyond just sort of the male female dichotomy. But as an example here, that's what we're going to go with. And so yeah, we don't have as much data, but go back to the previous slide please for still thanks. And we certainly could go and collect more data, but oftentimes that may be quite expensive or may not be feasible depending on the situation. And so the idea is that we can use synthetic data to augment this.

(13:53)
Yeah, so I think the question here going on, what do we actually do here? How do we go about doing this? So in the real data that we have available that we're working with here, we have about 500 examples from male patients and we have about 200 examples from female patients and they have various information about the patient, what sort of labs they might've had, tests they've had something about their medical history, and then also the true sort of risk of heart disease. And we trained a synthetic data model so that we can generate additional synthetic data that looks like this, that has the same properties, the same sort of shape, the same relationships.

(14:38)
And then we augment our data set. And we do this both by bringing up the proportion of female patients in the data that we're going to use to train our model and also just increasing the overall size of the dataset because classification models typically work better with more data. So we augmented our dataset and so that now we have about 2,500 examples for male patients, 2,500 examples for female patients. And so what does this do then for our model, we're now going to train some using different algorithms, a classification model to predict the risk of heart disease. Go ahead and move to the next slide and see some results. So we tried five different algorithms in this case. Obviously there's tons and tons of different models that you can try out. And in four out of those five models, we see that there's improved performance in terms of the accuracy of the heart disease risk prediction. When we use the synthetic data and we use the augmented data set that includes synthetic data, we get about 90.3% accuracy, mean accuracy across these models. Whereas if we're only using the real data, it's 89.0%. So we see a nice, a modest 1.3% in increase in our performance here. And so where is this increase coming from? What's going on? We dive a little bit deeper. We go to the next slide please.

(16:17)
And so in this situation, it wasn't just that, oh, we increased our accuracy on the female patients because there were more of them in the training data set with the synthetic data, but we actually increased. There was an improvement across both the accuracy of our predictions for the female patients was about 4.5%, but there was still a 1.3% improvement for the male patients. And some of that's probably coming because we had a slightly larger dataset. We'd augmented our dataset, which is more synthetic data for both males and females. And so we can see here that we can get benefits both on both subgroups here. And again, this is sort of a very concrete specific example, but you can think about expanding this to other situations, other demographics that you might be using it or doing the similar sort of thing with if you have particularly rare disease outcomes or something and you want to be able to boost, have more presence of those types of examples in the data set you're going to ultimately train on.

(17:33)
So with that, we're going to move on to a little bit of a quick tour of some other real examples use cases of synthetic data around healthcare and biomedical. And for any of these, we're certainly happy to dive deeper on 'em if there are questions or follow up and chat with us, you can communicate through that hi@gretel.ai email address that Yamini mentioned earlier. And we'll show that at the end as well if you have any questions for follow ups. So yeah, so first off, talk a little bit about some synthetic genomic data. This is some joint work we did with Illumina.

(18:18)
Yeah, so this is from some mouse genome data and there's a paper published around these mouse genomes looking at GWAS study, which looks at the correlations between what sort of traits that we observe in the mice and if there's any genes that are highly associated with those that can be sort of predictive there. And the top plot here is this, it's called the Manhattan plot I believe, but the result of main result from that G os study from the real paper. And then the bottom plot is the results where we created some synthetic mouse genomes and then ran the same style of analysis. And we can see some of those same places along the genome are popping up as being highly correlated with the particular phenotype of interest here. Now certainly the sharing mouse genomes is not such a big deal or we're probably okay training on mouse genomes directly. But obviously the hope here is that the same types of approaches, these same benefits could also be possible with human genome data so that you can still safely train a model on synthetic human genome data without needing to have direct access without that model training to need to have direct access to the individual human genomes of real people with PII and other associated data and such.

(20:01)
Yeah. So let's see here. We're chatting a little bit. Do we want to answer some questions right now before we go further into some examples? Sorry to put you on the spot there Za.

Speaker 1 (20:19):

Sure, that'd be great. So we have a couple questions around. So can we speak a little bit about overfitting protection, how we prevent the models from overfitting or any one sample?

Speaker 3 (20:33):

Yeah, I think there's a number of different approaches here. So there's certainly a range of techniques from the machine learning community to avoid over-training. And so this includes having validation set that you can check how performance on a validation set is. And there are ways to still utilize that when you are training a synthetic data model as well. You can certainly use early stopping where you only train until certain criteria are reached and so you can try and detect when overfit is happening. And then there's other specific modeling techniques like regularization or dropout specific techniques you might use for how the model works that can help prevent overfitting and you can automate these processes during training so that you don't necessarily have to be aware of this. And then the final thing is also just we can use quality scores, ways to measure how well is the synthetic data I'm producing, how well does that match with the real data? And so you can certainly compare at different points during training, what's the performance there? Either qualitatively looking at the data depending on what it is you're working with or quantitatively using statistical measures and use that to identify when or if overfitting might be happening.

(22:05)
And then last one, I think the part of the question was sort of on on specific examples and I think a really important technique here in terms of trying to make sure that one single example doesn't one single patient or something doesn't influence the synthetic data model and thus influence your synthetic data too much is the differential privacy. And so I think maybe we will have more webinars on differential privacy, but in a nutshell, differential privacy is a technique such that the model that you train is not too dependent on any given example and there's actually mathematical guarantees to ensure that that cannot happen. And so that's a really powerful technique if you are in a situation where you are particularly concerned about not becoming too dependent on any single example about not reproducing the sensitive information that might be in your training data, making sure you don't reproduce that in your synthetic data. And if there's more questions around differential privacy, I would be happy to chat more about that as well. Is there another question?

Speaker 1 (23:23):

Yeah, if you want to take one more right now before we take more of the q and a. There's one about hidden patterns in the data. So humans might miss patterns in the data and how do we ensure that a machine will find those patterns that are hidden? And the question that follows it is, wouldn't synthetic data cloud those hidden

Speaker 3 (23:44):

Patterns? So yes, I think that's an excellent question in many ways is sort of at the root of a lot of the more sort of philosophical discussions around machine learning. I think,

Speaker 4 (24:05):

Let's see.

Speaker 3 (24:11):

So I'm going to give sort of an off the cuff answer here. I think this certainly be something we'd be happy to chat more in detail with.

(24:19)
But I think one, I guess the first thing is that machine learning models in general, they're going to find all sorts of correlations there and they're going to exploit them as best they can. And sometimes they identify things that we as humans are sort of expecting. They identify that, oh, if you smoked you have a much higher risk of getting lung cancer. That's something that makes sense. We certainly expect the machine learning model to pick that up. But other times that may pick up something like some of the examples that Yamini gave earlier where the machine learning model identified from the training data that was given to it that if there was a ruler with in the image then it was far, far more likely to be, I forget if it was cancer or malignant or something like that. And so there's certainly a

(25:21)
Challenges around what is the sort of correlations and relationships that we as humans care about or that think are important versus what are the correlations and relationships that a machine learning model may identify and reproduce and are those important? Do they match what a person expects from the model and is there some way, are they sort of the right relationships to be going after? And all of this is to say that I think you should always still be sort of looking at your data. You need to be aware of what the training data has, if there's biases or oddities in that training data. And similarly, you should be looking at your synthetic data in addition to summaries and reports about it of what are the overall distributions, but also doing some exploratory data analysis on your synthetic data just like you would if you were doing data science, machine learning from real data to understand there are there things that are surprising, are there things that are strange? And then if there are any sort of to follow up on those as best you can. So that was sort of a long-winded and wandering explanation. Hopefully that somewhat answered your question, but we'd be happy to chat to chat further if you have follow-up, follow-up questions there.

(26:41)
Alright, so with that we'll dive back into a couple more examples. So first off, here's, let's see here,

(26:56)
Here's an example of really concretely with the relational database and specifically sort of EHR type data. So this system work with a top research hospital in the US where they have existing research data from collected from their cancer patients and it's stored across multiple tables with linkages between them and that database contains sensitive information and that sensitivity because that information in those tables is sensitive, that limits the access of who's able to access it. In particular, there are teams within the clinics that are interested in developing AI applications around cancer treatment and it was very difficult for them to get access.

(27:56)
And so we were able to use relational tool that Gretel provides to synthesize a synthetic version of this clinical database and that's able to preserve both statistical properties, relationships within each table, but also cross table relationships and foreign key primary relationships and such that exist in a relational database. And then with that synthetic data, we were able to train models on that. And so this greatly sped up the process of working on these cancer applications. We can train the models directly on the synthetic data much more readily and then the models don't have access to the direct sensitive information at all. And so this is able to both protect privacy but also have a sufficient data quality that the models that we train on these synthetic data are still valuable, are still useful. And with that we'll bump onto the next slide please.

(29:03)
And so here we just have some additional examples. I won't go into a lot of detail here, but again, two more examples of some EHHR data work where we're able to produce synthetic versions of EHR data, do interesting stuff with that. And then secondly, another example where we're using synthetic time series data. Here we have sensor readings from stomach muscles and using that using synthetic time series versions of those to augment data for training an additional machine learning model used for predictions. And we'll go to the next slide here and again, some more examples working with EEG data and also some postnatal blood transfusions. Again, we'd be happy to sort of chat in more detail around these, but it sounds like we have lots and lots of questions. So I'll go ahead and turn it over to Yamani who can quickly go through and discuss some of the practicalities of how do you actually use synthetic data, how do you set that up?

Speaker 2 (30:13):

Thanks Kendrick. I did want to take a few seconds here to address. There was a question in the comments about how does the actual synthesization process work. Actually our Gretel platform is free to use, so you can just go to gretel.ai and sign up for it. We provide a pretty generous credit allowance every month, and so we've got it set up so that there are some use cases and we provide sample data sets that really shows you how easy it is. We have a number of models that we provide that we make available. Kendrick touched on these earlier. We've got tabular, we've got relational, we've got image synthetics time series. So basically you can take your dataset and we make the process extremely easy. We will attempt to automatically detect the type of data from the structure. We never upload any of the data, so it always stays private, but we try and understand the structure and recommend a model.

(31:33)
You just go through the steps, you click okay, and then you go ahead and train the model on your training dataset. Once it's trained, you can use it to generate unlimited amounts of synthetic data. So that's basically how the process works. What I'm going to talk about here is how you can operationalize that. So this works great for one-offs, right? When you're going through the evaluation process, when you're trying to get the model that's going to be most accurate for your use case when you're trying to kind of tweak accuracy and utility and privacy settings, that's when it's really useful to have the sort of more experimental mode. But once you are ready to put it into production, you have to start thinking about a few other things. Next slide please. ZA data tends to be the big bottleneck. Everybody is. These days all companies are generating or capturing massive amounts of data, but they're not able to share it, they're not able to operationalize it.

(32:49)
After speaking with over 16,000 developers, we have learned that productionizing and gathering and cleaning that data can take more than 50% of the average project time. It can run into months if not a year or more to do. Next slide please. Here's a look at a typical AI lifecycle. We have our CPO, Alex Watson has actually written a very good blog post about this, so I'll let you guys check it out in detail, but you pretty much have to collect the data, you have to prep the data, you have to do EDA, annotate it, then you feed it into the model for training, then you have to evaluate the model, you deploy the model and you monitor retrain and so on. So it creates this cycle and a lot of things go into each step into each of those phases. Next slide please. So how can Gretel help?

(33:58)
We make it easy actually to automate a lot of these processes. We have validators that ensure high quality output and avoid hallucinations. You've all probably heard that that is a concern with some of the very large language models that are out today that are popular today. We have a battle tested platform with more than 30,000 users. We've trained more than a million models, so there's a lot of experience that has gone into this. Somebody asked in the comments whether synthetic data is mature or if it's still very much an emerging market. While it is early days, Gretel has been around for over four years now and we've built up a lot of experience that we have. We've put to use improving the product and making it really production ready. We have privacy enhancing technologies, we've got automated deployment and scaling. We have a Kubernetes based platform, so you can deploy in your own hybrid cloud if you prefer as well. This keeps your data completely private. We have a single click data source and output connectors. We also provide reporting accuracy in terms of privacy. We have tools to help with evaluating downstream metrics and these are all things that you wouldn't get with the popular open source models. Next slide please.

(35:40)
In addition to all of those, we've got a friendly web interface that I mentioned earlier. So you can see the reports there. You can pretty much go through. We have a wizard like interface interface that will help you generate the synthetic data. We have very powerful APIs, so it's not all ui. You can use our APIs to do all the synthesizing and report generation that you like. And then again, like I said, you can deploy as a software, as a service or in your own hybrid cloud. Next slide please real quick. We've recently launched a few new features that we're excited to share and then we also have new features coming soon. One of those is workflow, so the ability to chain multiple models together and to be able to fit it into your existing ML ops pipelines. We're also launching data connectors, which will help with this functionality a lot. We recently launched a tabular model with very enhanced levels of differential privacy. And then we also launched are about to launch a new version of our GPT model along with text metrics. So accuracy metrics, this is something that nobody else is doing really well right now. And then finally, we're adding accuracy scores for downstream ML models and for ML model evaluation as well.

(37:21)
And that's it from me back to you.

Speaker 1 (37:30):

Yes, I'm just bringing back everyone back here. Kendrick Yamini. Thank you everyone for asking a bunch of questions for patiently listening through. This is the time for q and a. We're going to get to our questions live in chat. We also have a number of questions that came in from our form. Looking at time here, realistically we probably have time for a few of the questions, but it's a very good problem that we have so many questions. So we'll be sure to follow up with everyone on an email with the top 10 FAQs and get answers to 'em to folks if folks get their question answered. If anyone has a question unanswered after this and doesn't get their questions answered in FAQ either, we also have a discord that you could reach out to anytime and ask questions there, but we're going to do our best to get through as many as we can here. So without further ado, I'm going to get to some of the live questions because they're coming in right now. There's one question asking about dirty data and looking at dirty data and training a synthetic data model. I will pull up that question right now as well, but Kendrick, or if you guys want to take that one?

Speaker 3 (38:38):

Yeah, yeah, that's an excellent question. Certainly data cleanliness, what consistency and such is a big challenge when working with machine learning models. And so in general, I think the approach that you should sort of take is probably that if you're expecting for whatever your downstream use case is, that it's going to be seeing missing data, nulls outliers, that sort of thing, right? Then you should try and make sure that the synthetic data model is producing those as well, so that that's part of the analysis that you would be doing on your synthetic data. And so that's at the most basic level is including in the training data that you pass to your synthetic data model, not doing that type of cleaning on it. And then certainly confirming does the synthetic data that I produce from this, does this look like the actual real data that I see in my live deployment or something? Do I see some outliers sometimes in the synthetic data? Do I see some missing values in that sort of thing? And if you're not seeing that, then you could go back and tweak and work on the exact setup of the synthetic model so that you produce that. And so really you want the synthetic data that you're producing usually to really match what you'd expect to see in your real data as well. And there's various techniques and things that you can do to make sure that gets closer if it's not matching right now.

Speaker 1 (40:12):

Cool. And there are two questions here kind of about data bias. I'm going to clump them together. I think they might have somewhat similar answers. There's one question about what kind of protections preserve the synthetic data that's generated to boost minority classes from having similar biases given they're AI generated. And there's another question that adds on to that asks that, what kind of data did we use such that it reduced bias? So I think maybe an explanation of conditional data generation or how that works to answer these questions.

Speaker 3 (40:50):

So the way that synthetic data models, in addition to just being able to sample synthetic data from them, many of these models, probably most of these models you can do, it's called conditional generation. And so that's where you say, I want to generate a patient record, but I want to set to say I want some of the values to be certain things that I've specified. So I want to generate a record for a female patient who's between the ages of 55 and 65 and then the synthetic model will sort of complete the rest of that record with information about other demographics or heart disease status and that sort of thing. And so with that, you can be very targeted in what types of data you want to generate, if you want to boost how often you see people from different locations, from different ages, different related potential complications for something. If you want to increase how many people you see in your training data with different diabetes or something, you have lots of different control of control around that with a synthetic data model where you can generate do this conditional generation.

Speaker 1 (42:14):

Cool. And then I

Speaker 3 (42:15):

Think I answered some of the question, but maybe not all of it. I'm happy if there's a follow-up za that I should do, please lemme know. Sure, yeah,

Speaker 1 (42:24):

Sounds good. Taking the next question here, I think we've covered part of this, but I think it might be helpful to reiterate those of us steeped in synthetic data day and night sometimes take this for granted. But the question is what is the basic workflow of generating synthetic data? And can you illustrate it with an example, say by using the synthetic data you generated for patients with heart attack?

Speaker 3 (42:50):

Right. So I think the basic workflow here is that we started with some real data. It was collected from somewhere in the particular example we walked through here. We were using a data set that was up on Kaggle from a competition there, but you could certainly, you'd use your own internal data or other data that you can source. And so that's the sort of real training data you're going to start with. And just like any sort of machine learning, you should do a little bit of exploratory data analysis on that, understand where it came from and start thinking about if there's any potential biases or things in here that you need to be aware of, you might want to try and change or reduce those biases during the process. But so you have your training data, you train a synthetic data model. So this is, these days typically like a deep learning model, but instead of this model outputting true or false for whether someone is say at high risk or is likely to have a heart attack, this synthetic data model, its output is an example itself.

(43:58)
It outputs the data itself. And so you train this model and now you can continue generating, you can sample from this model to generate as much data as you want. You can do conditional sampling from it to change the proportions, the ratios of different variables if you need. And so that's the basic workflow. And then as in many of these situations, you would often do some iteration. The first model you train, the synthetic data you get back from that may not match what you need for your use case. And so you'd probably iterate a little bit on what model you're using, the exact parameters it's using, maybe do some tweaks there to really fit the synthetic data that you're getting to the situation that you have, how you want to use it.

Speaker 1 (44:50):

Cool. There's another question here. This actually came, I'm just going to take one question from the email because I think this showed up four or five times from the form, but there's questions about evaluating and validating the accuracy of synthetic data. So can we speak to a little bit about what that workflow looks like and how one goes about that?

Speaker 3 (45:08):

Yes. So evaluation here, the main sort of generic approach is to look at the statistical properties of your synthetic data and see how those align with the statistical properties of the training data are real data that you built the model from. And Gretel has a SQS report that we automate this process when you're using our product. And so it produces an overall score of how well it's doing, but also more detailed charts and breakdowns of different places where your synthetic data is maybe better, has better correlations, better properties, and other places where it has worse. And so that can be a first step to direct you on how you might iterate if you want to improve the quality of the data. But ultimately I think the final arbitrary of the quality of your synthetic data is how well does it do for your downstream use case. And so that's always the best way ultimately to evaluate the synthetic data. Is it useful for you?

Speaker 2 (46:18):

Hey Za, do we have time for one question? I've seen this one come up pretty often as well and was hoping Kendrick could answer that one. Answer it. Sorry, you're muted, but I am guessing you're telling me to go ahead. This is how do you know if you have enough data to synthesize? We often have customers who come in and say, we don't know if we have enough data. What do we do in those instances?

Speaker 3 (46:51):

So I think there's a couple different avenues or directions here in the most basic use case where you're training a dedicated model from your data and just your data, there's certainly some lower bounds here. If you only have 10 patients, it's going to be really hard for a model to figure out what are the distributions, what should this look like? So I think, I mean, I don't know, it may be a rough guideline, maybe you want at least a hundred examples or something. But it's also going to depend on the model, the exact model you're using. A simpler model in general doesn't need as much data to be effective, but it's also not going to produce as high quality data in your result.

(47:40)
But there are also certainly situations where you only have 10 examples or one or two examples, or maybe just you don't have any data right now. You just know the schema, you know what you want it to look like, the data to look like. And so in this area, I think there's a lot of really exciting new research, new progress around here and some new previews that Gretel that we're planning to come out with soon of being able to have a large model that's been pre-trained on lots and lots and lots of different data sets, and then you can then go in and so it sort of knows what does data just look like in general If you have these types of names of columns, this is the type of data usually have, these are the correlations that tend to be there that certain names tend to come from certain geographic locations or in the US certainly different names tend to be associated with different generations, that sort of thing.

(48:38)
And so then with this very large pre-trained or foundation model, you can prompt it with maybe just a few examples or just with a schema or a textual description. And then that will be able to produce a data set for you. There's again, going to be some iteration process here to sort of tweak the data to match really what you're looking for, what you need. And then once you have that from this foundation model, you can then either use that synthetic data directly or you could use that to train a more dedicated model just focused on this particular synthetic data task. And so that's an area and a use case that we're really excited about. And if that's interesting to you, please reach out through Discord or through the Hyatt Gretel AI because we'd really love to have a chance to chat with you about it and have a chance for you to give feedback for the initial preview that we're planning to come out with this summer.

Speaker 1 (49:36):

Cool. There's another question here about using Gretel to generate synthetic data from air gap data sources.

Speaker 3 (49:48):

I think the short answer here is that no Gretel does not operate in an air gap situation with our main product. You could certainly use the code that we have in our source available repository. The code for most of our models, the core code for it is available at the, it's gretel ai slash gretel dash synthetics, I think is the GitHub repository. But we do have a hybrid option where the data never leaves your virtual machines, your machines and such, but the machine still needs to be able to connect out to the internet to download the container to run your model and for metadata and billing purposes and such. But in this hybrid setup, the data never leaves your containers your control. And similarly, the quality reports, for example, don't get sent back. That's all internal in your own setup. And so we hope that for the vast majority of cases, that hybrid setup is sufficient. And we don't necessarily need to go completely to an air gapped setup,

Speaker 1 (51:16):

And a few folks asked about this, but they ask about how can relationships and correlations and real data be represented in synthetic data?

Speaker 3 (51:34):

So I think one, this is a great question, I think maybe to have a conversation around. So the way that I'm sort of interpreting this is that the relationships correlations from the real data, the synthetic data model trains in that real data. It sort of encodes that information in the weights, the parameters of the model. And then when you sample from the synthetic data model, it's able to reproduce those relationships and correlations in the synthetic data. So it's going to be something like, okay, lung cancer tends to be correlated with having smoked previously that the model, the synthetic data model sort of learns that and is going to produce a dataset where if you've smoked 60% of the time in your EHR, maybe you eventually get lung cancer. So that relationship is preserved, but the actual individuals don't come across. If in the real data there was a 69-year-old who lived in North Carolina and they smoked and then had lung cancer, you're not going to have that exact set of values for an example that's in your synthetic data that the overall properties, correlations and such of the data of the synthetic data remain the same as the real data.

(52:59)
But the individual examples are, there's not a mapping or a way to map back and say, oh, this synthetic data example came from this particular example, this particular patient in the real data. There's no direct mapping like that. But yeah, happy to chat more offline around this if that didn't answer your question.

Speaker 1 (53:24):

And this question also came in a few times in the registration form. Folks are asking about how to prevent sensitive data from leaking into the model during training, certainly topical, but I was wondering if, I know we spoke a little bit about validators, but if Kendrick Amman, you guys could provide an answer to this one?

Speaker 3 (53:46):

Yeah, so I think there's several different techniques that can be used here, and you might use different ones depending on the situation, on the exact requirements that you have. You might use a couple of 'em. So in no particular order here, so one of the most powerful techniques here is differential privacy, where if you have a differentially private model, there are mathematical guarantees about what sort of data or sort of identifi and identifiable information of a particular example can be leak into the model, if you will, and then ultimately to whatever synthetic data or other outputs you produce from the synthetic data model. And so I think that's in many ways the powerful, it has most the best sort of guarantees and is becoming a defacto standard in the area. But there are also a lot of other techniques that you can use depending on your requirements, right? There's doing some de-identification or anonymization, pseudo anonymization before you train the synthetic data model so that the synthetic data model never even sees the actual real names you've already removed or changed those before producing the synthetic data model. And then there's other techniques that we're really excited about, such as using reinforcement learning with privacy feedback to help guide the model training process to not reproduce sensitive information that was in the training data.

(55:36)
So I think those are some of the techniques, and there certainly are some others out there as well. And again, it's going to depend a lot on the exact use case that you have this situation, you are, what kind of concerns, what kind of data are you concerned about leaking out through a synthetic data model? And that'll help guide what types of tools and approaches you might use to avoid that.

Speaker 1 (56:06):

Okay, cool. So we'll take one last question here. This question came a few times from the registration form as well. Before we wrap up. This question is specifically about generating more data for clinical diagnoses and situations in which not enough data exists.

Speaker 3 (56:36):

Okay, so I think this is talking about you have very sparse data. You may have lots and lots of patients, but any given diagnosis or any given disease is fairly rare. And so what can you do there? I guess first off, if you have, I don't know, say maybe less than a thousand diagnoses that you're working with, even though each individual diagnoses is fairly rare in general, the models that we use still do quite well at that. They still pick up that these diagnoses exist. They learn about the correlations and relationships they have with other variables in your dataset, and then you can utilize conditional data generation, conditional generation to generate synthetic data that maybe only has patients with a particular rare disease or something. You can tailor that, those proportions, those ratios of these diagnoses in your synthetic data. And so you can use that to explore and fit to whatever use case is. And if you do get into situations where there's lots and lots and lots of diagnoses, more than a thousand diagnoses, there are some more sophisticated tricks that we can do and sort of how do we encode that? How do we pass that information to the synthetic model to do better in those situations? And I'd be happy to chat more if that's a situation that you're in.

Speaker 1 (58:16):

Perfect. I think we'll wrap up here. Thank you everyone for joining. Again. Thank you Kendrick and Yamani for hosting this great session. I know I learned a ton. I'm dropping in the comments here just to look link to our Discord community. Our Discord community is the best way to access our internal resources. Ask whatever questions you have in a safe space. You could try one of our channels, you could message one of us directly. And everyone who works at Gretel has a parentheses, gretel next to their name. So that's how you're actually reaching a real greer. Again, thank you everyone for joining, and we look forward to seeing you in future webinars. Thanks again.

Improving AI Accuracy and Reducing Bias in Healthcare with Synthetic Data

Video description

More Videos

Generating differentially private synthetic text using Gretel GPT

Anonymize Financial Data with a Fine-Tuned SLM

Video - Generate synthetic data for training LLMs and SLMs

Read the blog post

Transcription