Workshop: Generating Synthetic Data for Healthcare & Life Sciences
At Gretel when it comes to data we care a lot about ethical, equitable and fair practices. Gretel’s CPO Alex Watson gave a workshop to a data science working group at Emory University on how you can address these practices, and how to leverage tools from Gretel to create statistically accurate synthetic data for health and life sciences research. You can watch the video or read the full transcript below, and then start creating your own synthetic data for free with Gretel.
Hey, my name is Alex Watson. I am a co-founder at Gretel AI. Today we're going to talk about synthetic data and how we see it being used by data scientists and software developers in the healthcare and life sciences spaces. We'll start with some of the use cases that bring certain life sciences or healthcare organizations to say I'm interested in checking out synthetic data. Actually the use case on the bottom is probably the most popular one that we've seen, enabling faster access to data. Often getting access to medical research data sets can take four to six months to go through the necessary approval processes to meet compliance obligations. The question that we ask with synthetic data is can we create an artificial data set based on this real world data that doesn't link back to any actual patients? And can we use that data set to enable a doctor, for example, to learn about a rare disease without learning about the patient?
So really exciting use case there we see with a fair amount of our customers really diving it on. Other use cases are around making better data sets. Both reducing bias, and then also generating more samples from limited data sets are around kind of the same goal where you have access to some data, but you don't have enough of the right data. So a question is can we boost the representation of a minority class in a data set or change some, engineer some feature of the data to create a more equitably, fair or balanced data set. Some of the examples that we've worked with and I'll link to them in the blogs before are working with the UCI data science team on an openly published heart disease detection data set, where there was a two to one representation of males to females. We had really encouraging results. Essentially, we boosted the representation of female patients in that data where it's prohibitively expensive to gather more data and then ran some of the same top models from Kaggle across that data and noticed both a 6% increase in overall accuracy for the data set for female detection and a 2% in overall accuracy.
We've talked about some of the use cases around synthetic data. Now let's take a chance to talk about what synthetic data is. Here's a definition I like quite much actually came from the Invidia folks. Synthetic data is an annotated information that computer simulations or algorithms generate as an alternative to real world data. So synthetic data is nothing new. It's been something that's been around since the advent of simulations and algorithms to create data, given a certain representation or distribution that it's seen. What has changed in the past couple years is this massive advancement in deep learning. Today we'll talk about the techniques that are used in this industry, really focusing on one of the core algorithms that Gretel uses, actually, which essentially trains a language model or a sequential model on a customer, a very sensitive data set. It trains it while imposing a certain level of privacy parameters and, and preventing the model's ability to memorize data it shouldn't.
What the model then outputs actually is a new artificial data set that has many of the same insights and distributions. In fact, in some cases, even better accuracy than the original data. I'll talk about that in a minute, but then actually isn't based on any real world person object or thing. What's new with synthetic data, here we see, it's not often I include a Gartner chart in a data science talk, but I think this is a really interesting one to talk about. What we see as a change in the way that organizations are accessing and being able to work with data in the sense that as we expect more and more from a machine learning algorithms, as we have more and more devices that are gathering data, there's a couple big things that are changing. One due to sensitivity of data it's harder to enable access.
There's a risk of a model memorizing data that it shouldn't. That's challenge number one. Number two, in the IOT space, one of the things that we see is that less data is actually being sent back to the cloud. As we've advanced on technology I think a really simple is example to give is the Alexa or the, the Google assistant devices we have in our home. When they're actually able to push machine learning models out to those devices, they can make decisions much more quickly. Self-driving cars, same thing. The decision's being made like right there on the device. That means the data's not going back to the cloud. So if you're trying to train a model to become really good at a certain type of detecting a certain type of scene or object or voice utterance, at some point to recreate all of the possibilities that might exist or might not exist in the world, a simulation or a synthetic version actually becomes a much more scalable approach. And we're starting to see that both across our customers and how they're building with machine learning and also reflected in the industries you see here.
So we can talk about synthetic data all day long. One of the things I think actually would make this fun and interactive was going through and, and actually using CTO code, building our own synthetic data model. Diving right in this approach for synthetic data is really very similar to open AI's, GBT models or other language models that we've seen transformers, things like that in the industry. And we're going to go through a very simple example here using a recurrent neural network, LSTM, to learn, to recognize and replay certain characters that exist in a text stream. Where open AI has built incredible models for processing natural language Gretel and other synthetic data approaches actually have applied very similar approach, but to learning either loosely structured or structured data, this could be CSV. It could be JSON. It could be something that you've pulled out of your data warehouse.
Essentially it learns the structure of a CSV. As we can see here, we've got three columns, learns that as if it's its own language and then replays it. The first step of what we do this is we need to take these arbitrary character sequences, which can be characters, it can be commas it can be emojis, it can be whatever you want and map that to some sort of integer representation. So this is called tokenization in the LP space, or vectorization. Here where you can see the character three is being mapped to the character eights. We see a space as the third character, which has gotten the representation one. We see another space like later, see that reflected right here. So very simple, just mapping that we're doing here of each character to an integer representation that we can feed into a network.
Step two, how does it take these individual characters and turn this into a model that can recreate language? The answer is surprisingly similar, simple, actually looking at it and going through here. All we're doing is saying, given a certain set of input text that goes into the neural network, can you predict the next character? And then networks are capable of doing this at such a scale that they can recreate entire language stories, things like that, just given input data. So here we've got a very simple function and all it does is take an input data set such as here we see so hot right now and returns the next character. So you can see it chopped off the S at the beginning, and introduced a period here at the end. What the network is learning here is that given the inputs so hot right now, most likely character distribution we're going to see it for the next character is going to be a period. If we do this enough, can we learn to learn and replay an entire language?
Step three, here, we're going to define our neural network model. We'll link code to the bottom so there's no need to, to cut and paste here. We've got simple GitHub repository you can use to run through the entire example. But here we're going to use a popular machine learning framework called TensorFlow. And we're going to use Caris to make simple, to build the model. And we're going to go ahead and create a sequential model that we can use to predict the next character. We're doing some fancy things here. You see actually two LSTM layers being used here. This helps the network learn more complex embeddings and representations in the data. And we see dropout layers, which is being used to help the model generalize better. You can read up on any of these at the end of the day, we're configuring a neural network here that's going to treat our input text and learn to predict the next character.
After the model's been created, we run our test data through it. So here we can see we're training for 15 epochs. Really what we're doing is for every single possible next character shifted one by one that exists in our original data set. Can we train it to predict the next character? What we keep a look on here very closely is the loss. So here we're looking at prediction loss, how good is the model predicting the loss, the next character based on our cross entropy loss. We see that going down when that stops improving. We know that the model is reached a good kind of plateau, and we can go ahead and start to use it for prediction.
Now, for the fun part. Now that our model's been trained on our existing input text, we can tell the model to create as much data as we want. There's two ways that we can do this one. We can bootstrap the model. We can essentially prompt the model with some level of text. For example, if we were looking at a biased data set and when to create more female records, we could start by inputting to the model that this record is female. The age distribution is within 30 to 50 years old and have it complete the rest of the record based on what is learned on the input data. Alternatively, we can just let the model run by itself. In which case, the model will over time create a new data set that has a very similar size and distribution to the original data.
Here's a fun example of the bias reduction use case that we mentioned earlier, working with the UCI data set. The original question was can we increase the representation of female records in this data set to be on parody with the male representation. What would that do for the overall accuracy of models built on this data set? Had very positive results. So that here we see the increase going up, as I said, nearly 6% across the female population here, and 2% accuracy across the overall data set. So here's an example. What did we do working with the UCI folks? We took the original data set from Kaggle. We created a simple synthetic model and told it to train and boost the representation. As you can see there very briefly to, to build an equal representation of male and female attributes. A note that you can actually go as arbitrarily complex, as you want to here, if you wanted to bucket age range and ethnicity and gender, all of those things can be done. This is just a very simple example, boosting one simple minority class in the data.
After we created our synthetic data set, we took one of the top notebooks that we see on the Kaggle platform and ran against... Here we see six different classification techniques here. So anywhere from very simple techniques, such as a simple decision tree, that's being trained on the network naive phase, for example, to more complex, random forest, support vector machines, things like that. And we looked at the overall results. Here you see pretty awesome increase here and, and actually some interesting insights in the data that may tell us a little bit about how a synthetic model works. What I see here looking across is an increase in accuracy for actually every single model with the exception of one. It's worth diving in here and often these models can be very difficult to explain why naive based for example, the overall accuracy of the model did not improve after using the out of balance data set. A theory about this one that is worth diving into is that naive base assumes that there is no co-dependence or correlation between the different variables we're training.
From practical experience, it does seem that the synthetic models are very good at learning deeply correlated things inside of the data. So it may have replayed certain types of correlations. For example, if you were over six feet tall, you might weigh over 200 pounds as an example. It may have replayed this which aided the other algorithms in their decision process, leading to a better result than the overall data set. Naive base, which doesn't take this into account may not have been able to take advantage of it. Earlier we mentioned the possibility of something more complex than addressing a single imbalance. Another example we've got right here, and then you can actually run through the source code as well is balancing both race and income bracket and gender to create a better version of a U.S. census data set to use for ML prediction.
A final question we get often is were I as an organization, for example, to take just my super sensitive data set, create a completely artificial synthetic version of that data set. What is the loss or hit to utility or accuracy that I might see? We saw in the previous slide, that when you are able to use synthetic data to engineer your data, so to create a less biased or more fair data set to work with, we saw that there's a potential for even increasing accuracy for the different tasks that you want to apply in this machine learning for. In this case, what we're doing is a very simple substitution, where in purple you see is the original data. In blue here we see a completely synthetic version of that data set so completely artificial using none of the original records.
We tick the top six or seven different data sets that saw on the co data set platform. And we applied the different tasks, so classification tasks that these were intended to be used for. Does a user have heart disease? Was this person a successful hire or not? Things like that. And then here, you can see the results from running essentially a default configuration of Gretel synthetic data on a machine learning task against the original data. So often very similar. In some cases you do see a hit. In this HR attrition prediction we see a nearly 6% drop in accuracy, but for many cases where you're concerned about here, for example, like individual employee records or things like that, being memorized by a model and becoming identifiable, becoming a security or compliance risk for your organization. It's an acceptable trade off to work with and that's one of the things that we're always working to improve. That's it for today. Thank you for your time. And don't hesitate to reach out if you have any questions. [End]
If you have questions, join our Slack community to continue the conversation.