Synthesize 2023: How Foundation Models can help unlock Multi-Modal Synthetic Data
Video description
A glimpse at the future of multi-modal synthetic data generation
More Videos
Read the blog post
Transcription
Ali Golshan (01:13):
Next, I'd like to welcome my co-founder, Alex Watson, our chief product officer. He runs all of our pride science as well as product management and design. Alex is going to talk a little bit more in depth about synthetic data, large language models, and really importantly, what are we actually seeing and how are we building a lot of our decisions around product roadmap. I'm really excited to hear this talk from Alex, so here comes Alex.
Alex Watson (01:36):
Hey, I'm Alex Watson. I'm a co-founder and chief product officer at Gretel.ai. Today we're going to talk about foundation models and how they can be used to unlock multi-modal synthetic data. Let's start from the top and talk about generative AI and synthetic data and what that means for your organization.
(01:51)
Whether you're training a model from scratch or you're fine-tuning a foundation model in your organization's data, we'd like to think of generative AI models as an activation of your data. What does that mean? Generative AI models can be queried to learn insights. They can be updated in real time and they can be used to generate synthetic data to power a multitude of use cases from internal data sharing, to data exchanges, to creating training sets for machine learning.
(02:16)
Three years ago, we launched the first synthetic data model for tabular data on GitHub. It's based on an LSTM recurrent neural network and featured the ability to work with both tabular, text, and time series data in any language format, making it a very powerful place to start.
(02:32)
Here we see a use case where a hospital was seeking to train a machine learning model on patient data to better predict how long a hospital would have a patient within the emergency department. Due to the sensitivity of this data, just giving developers access to the data to train the model was difficult, much less thinking about how it would be deployed and the patient privacy risks that would be associated with it. What the hospital found was they could create a synthetic model and synthetic data site with entirely artificial records that produced real world accuracy when trained using machine learning model with none of the risks around patient privacy.
(03:08)
Users quickly realized that generative AI models could be used for much more than anonymization or creating new data sets that match the distribution and original data. What if you can make data better? In the example in front of us, university researchers at UCI took a popular heart disease detection data set that had a flaw. As you can see, there was a massive distribution change between the number of female and male patients in the data. This leads to training machine learning model on top of that data where the machine learning model will be excellent in this case at detecting heart disease across male patients, but very poor at detecting across female patients.
(03:42)
The thesis didn't ended up working out for the UCI researchers. We [inaudible 00:03:46] a very common use case for generative AI today is boosting the representation of females in the dataset to have an equitable distribution that we used to train the machine learning model. This resulted in a 2% overall accuracy improvement across the dataset and a 6% improvement in the ability to detect female heart disease.
(04:05)
Three years later, we have learned that the use cases around generative AI are far more varied than we ever would've thought. There is no such thing as one model to rule them all. Sometimes it makes sense to fine-tune a foundation model. Sometimes it makes sense to train a model from scratch. Today we offer over six different models, whether language, time series, image, text, or time series oriented that users are able to build with to create models to build on top of their data.
(04:31)
Synthetic data and generative AI has opened up a new set of use cases for organizations. One of the most powerful sets being the creation of synthetic data exchanges. These exchanges are often based on data that was too difficult to share, whether it was privacy, regulatory or compliance focused. On the left, many of the advances in the medical and the life sciences research space have been powered by the ability to share data. Unfortunately, that is still a very slow process, often taking six to 12 months to get access to data sets that researchers need to test an idea or a thesis.
(05:05)
Later today, you'll hear from folks at Illumina about how they're able to synthesize genomic data and recreate the results of popular research. This opens up the door for entirely artificial data sets that aren't based on any real person to be shared across the medical community to fuel research at a global level. Another interesting piece there is that it opens up a new business model for hospitals that have yet to find a way to successfully monetize or build upon this data.
(05:33)
Another extremely promising area for data exchange place is in the ad space or the ad recommendation space. In addition to companies that are working with ads on a regular basis and looking for more private ways to do so, synthetic data opens up the opportunity for companies that have never monetized their data due to privacy concerns, to share information that is incredibly valuable to ad recommendation systems without sharing any actual information about their customers.
(05:59)
On the far right, we see examples from the FinTech space. There are many examples and many use cases the FinTech space is going after, but one of particular interest is around the prediction of rare events and market changes. Many of us are very familiar with GameStop and the crypto market crashed and as well as even major shifts and drifts in how people interact with data. Another really promising space for generative AI is around working with precise location data. And the examples in front of us, we see both. Scooter locations which are often shared with municipalities to aid with city planning, also to understand traffic patterns. This data can be incredibly private and this is actually working with a scooter company that wanted to run a test to see if it was possible to train a generative model to create a new data set of artificial source and destination rides that could be shared with municipality that would have the exact same insights and distributions as the original data.
(06:56)
What we found was that not only were you able to succeed in this use case, but that same model trained on a couple cities in southwestern United States could be applied globally to predict where users might be located. On the far right we see in the example from Tokyo about as far away as possible from southwest United States. What we see is this application of transfer learning where the model was able to learn about things, whether it's intersections or bridges or city corners, where it was able to predict with a high degree of certainty where users might be located in a city that had never been trained on.
(07:29)
Now for the exciting stuff, where do we see things going? Automation is a major use case. When we think about generative AI, many of us have experimented with ChatGPT. Many of us have experimented with Stable Diffusion. One minute you'll get an incredible answer to a question or to a prompt that you provide. The next minute, you'll get something that makes no sense at all. When organizations think about deploying generative AI for their own use cases, having extreme confidence that the model's going to generate things that match their expectations or the prompt that they provided the model is very important.
(08:01)
In the examples to the left, we see examples of training machine learning pipelines on data created by generative AI. It's not one or two images, it's a thousand, so it's important that we have confidence in this type of data as it's generated.
(08:15)
Diving in on some of the exciting use cases we see here on the left, we look at an image recognition pipeline that's being trained by a company that is looking at receipts. These receipts can be simulated with a variety of lighting conditions, tears, even different types of text, another application of multi-modal synthetic data.
(08:34)
In the middle, we see an example where an insurance company is training machine learning models to recognize and to predict the cost of repairs for certain types of windshield cracks. For a pipeline like this to be successful, you need to be able to create multiple images with a high confidence that you have a picture of a car with a window matching certain patterns of window breakage, whether it's a radial pattern or a consistent splash pattern across the window.
(09:00)
Another application here of the same types of generative AI we see was Stable Diffusion, but a focus on high confidence in the data and validation of the data that is created by the model. On the right, we see examples of a developer trending a chatbot to recognize new examples of customers calling in asking about late card arrivals for a banking card.
(09:22)
In addition to all of the interest around creating data exchanges and new monetization capabilities, we see tons of interest in organizations building their own model hubs. These are used to facilitate data sharing within the business. Unlike a data warehouse or a data lake, it doesn't require any copying of the data. You simply train models. These models can automatically apply privacy and they're accessible via a simple API with a natural language or a simple structured interface making very accessible to people across your business.
(09:52)
We see a future where developers will prefer working with synthetic data to its real world equivalence. Why? It's cleaner, it's more private, it's easy to work with, and it can be customized to the exact use case that you have. Thank you so much for your time today. We'll look forward to any questions in Discord.