Developing Multi-Modal Generative AI Programs for Enterprises

Video description

Apr 20, 2021

Gretel CTO and cofounder John Myers explores building a Generative AI program, with a focus on synthetic data, geared towards multi-business-unit support, focusing on the theme of 'building for posterity'.

Read the blog post


Good afternoon everyone. Glad to be here. My name is John Myers and I am the CTO at Gretel ai, and we are a generative AI platform that specializes in creating synthetic data. All right, so what I hope everyone here can take away today is to kind of aggregate a lot of the awesome talks we had this morning that talked about a lot of different use cases for generative ai, and then think about how to actually build a generative AI program within the enterprise. And so what I want to do today is talk about the different types of modalities for generative ai, the different types of data you can actually operate on. I'd like to walk you through some example workflows that we see with our customers today, and then talk about some common vendor deployment models so you can help think about the different characteristics of tools and products that you'll want to deploy inside of the enterprise and incorporate synthetic data into the business.

So what we're going to talk about, we're going to give some definitions of synthetic data and generative AI again, and talk about different multimodality that exist within that domain. We'll walk through a couple of machine learning and synthetic data example workflows. And these are examples that we pulled from our customers on how they integrated synthetic data into their ML ops processes. Talk about some deployment considerations, and then we'll go into q and a. So let's start with some high level definitions. So synthetic data is generated data that mirrors the statistical properties of real world data and enables building on top of that data while it simulates real world situations without having to have access to the original data. Synthetic data can be generated through these generative AI methods across a variety of modalities. These would be different modalities that help you process lots of different types of data that could be tabular, multi table and relational databases, natural language and time series data.

And we do this through a variety of different types of models, which are all deployable through gretel, either in our cloud or you can run it on a cloud prem model, what we call hybrid inside of your enterprise. And then the benefit is that you have this flexibility of where you can actually deploy. If you need to get started very quickly and experiment, you can jump right into grettel Cloud. Once you want to deploy its scale, you can actually pull things into your own environment and kind of build your own internal corpus of models that are yours and yours only, and you're not actually now sending any of your sensitive data to any cloud provider.

So let's talk about modalities of synthetic data. So these are the kind of core modalities that we support at Gretel, and you'll find that there's a lot of different tools out there that support these modalities through a variety of different ways. The most common one we have is tabular data. This can come in a couple of different flavors. This can come in single table, and typically the use cases we have with single table tabular data is actually taking machine learning training data sets, creating a synthetic version of them. So they're basically safe to share more broadly across the enterprise. It's really easy to get started with that. If you already have these types of data sets curated, you can come right into a Gretel product, create a synthetic version of it, and then start basically sharing it and then using it across the enterprise without worrying about those privacy concerns.

Some examples we have here with one of our actual customers today is making use of opt-out data. And so typically what we found with a lot of our customers is in order to make data shareable, they had to apply different types of transformation techniques to the data. So maybe you're bucketing certain values in the data or you're kind of watering it down a little bit, so you don't have a lot of fidelity in that data. The way we solve that is by allowing 'em to create a synthetic version of that data, then you don't have to use that overly bucketed data. And then you have something that can be more broadly shared for your classification and regression use cases, stepping further down the use case line. In more advanced situations, we have the ability to allow folks to augment their data sets with new records, and this allows you to either increase the amount of training samples or create training samples of a certain class within the dataset.

So when it comes to moving into your ML ops platforms, a lot of 'em have different requirements. You have to have a minimum number of training records, and they also suggest that you have a certain balance within your data. If that doesn't exist in your original training data, you can actually then use synthetic data to actually fill in those gaps, so to speak. The next modality that we're most common with our customers is time series data. So this is basically operating on data that is linear across time. And the most common use case we have here is to fill in a lot of gaps for data that is generally pretty expensive to collect. So when one of our customers has kind of physical sensors out there collecting information, they're expensive to deploy, and so they might have a handful of them out there and they're starting to collect data points within a certain range.

And so in order to kind of fulfill their simulations for their needs, they're using synthetic data to actually fill in some of the gaps in the time series. The next one would be natural language. There's a lot of that today that we talked about earlier. Really this comes down to being able to build your own internal G P T models. So earlier we had the talk from the Federal Reserve, and one of the questions that was asked was, Hey, are you using pre-trained models that are out there and open source to kind of build your own internal L L M? So with gretel, what we allow you to do is to actually pull some of those models out from the ecosystem that are out there, bring them into your environment, you can fine tune them on your own data, and then you can use those G P T models to fulfill a bunch of different use cases.

So some of the most common use cases we have with our customers today, it's creating additional training samples for N L P modeling, some of the more concrete use cases here, creating additional samples for toxic language detection, spam and fraud detection, and helping bots for training and development like chat bots and personal virtual assistants and stuff like that. And then our last modality of that is out there available on GRETEL today through a free public preview is synthetic images. So this allows us to build on top of tools like stable diffusion. You can show up, you can bring prompts, and then you can generate images for ML training that are generally pretty expensive to acquire one way or another. So think about insurance and medical tech. We have use cases where insurance companies want to create images of really catastrophic accidents so then they can be automatically analyzed for different types of things like fault and stuff like that. So since those are really hard to acquire, since I don't think anyone wants to go out and intentionally crash a bunch of cars, we can generate those types of images and then you can bring them right back into your ML ops pipeline.

So whenever we're going to get into some example workflows on how to kind of integrate synthetic data into your ML ops process, so there's two use cases that I'll walk through. The first one is kind of focused on, we already have ML training sets, they don't perform well because of imbalances in the data or there's not enough training samples. And then the second use case I'll walk through is we want to introduce ML into our environment, but we don't have any training sets yet, but we just need to get started. How do we explore our production data to trying to again get to that point where we can create high fidelity ML training sets?

Alright, so here's kind of a walkthrough at a high level of what we see inside of a couple of our customers that are actually already building ML AI models, but they have a challenge with imbalances in the data. So we'll start with the ML AI data sets usually then we'll have the teams already starting to analyze those data sets for imbalances in the data. Once they're able to identify what their kind of gaps are in that data, you can take the ML AI dataset that you already have and train a synthetic model on it. In this case, we're training a tabular data model. So we have a whole variety of models that you can use for that depending on the need, language based, GaN based, statistical based. We have models that focus on differential privacy. So we have four or five different models you can actually choose from.

And you train that synthetic model. Now you can start generating additional data using what we call conditioning. So let's say you are collecting data from a bunch of different locations and you don't have a lot of training samples from the city of New York, for example. So what you can do is once you train that model, you can say, Hey, you know what? Give me 500 more records, but make sure the location is New York. And then what the model does is it fills in the rest of the record for you based off of what it's already learned about the records that are derived from New York. And so once that happens, you can add that data back your training set, continue with their M L AI experimentation and see how your classification or your aggression models are performing. And then you can kind of keep iterating and tuning on that to make sure that you're building an M L AI dataset that is balanced in the way that you need it.

Here we're going to kind of walk back to an example where you might have an enterprise that's just getting started with an ML AI program for a specific set of use cases. And in this situation, what we commonly see is you have applications that are out there, all your data is residing in these production databases. It's highly normalized, it's spread out across data warehouses or relational databases that are designed for transactions, right? Because you're building an application that users are kind of interacting with and you're just building this huge corpus of data. You might not have any ML AI training sets yet. So what do you do in this situation? Well, in this kind of situation, typically what you would do is you would explore all that data and then start kind of crafting the kind of queries you need to start building those training sets.

The problem here is privacy. How do you make a safe version of that production database that you can actually comb through and analyze? So what we allow you to do is be able to take that production database and create a subset of it, right? Because at this point you're just doing data exploration. So if you can create a subset of that database, you can start exploring with it, kind of build out the different query patterns you need. For example, you might finally develop your data set, save that as a materialized view, push that back to the production database, and now you can harvest the data from the production database with that materialized view, kind of gives you that single table that you'll want to train on 'em in your ML ops platform. Once you have that single table, then you can start synthesizing that table into a new data set and start using that in your ML ops process. So this is really common for companies that are really trying to kind of break m l AI into new areas and they just completely blocked because they don't have access to any of the data to begin with.

Alright? All right. Now we'll get into some deployment considerations. So the deployment considerations here, there's kind of three different varieties. Gretel actually offers all three of these. You'll find that as you explore different tools, some are in one bucket, some are in multiple buckets. And so we'll start with the open source or semi permissive licensing. What I mean by semi permissive licensing is you have different packages out there that are available to use, but you have to read the fine print. A lot of licenses are moving to source available licenses or restrictive licenses such that you might be able to use them in the enterprise, but you might only be allowed to use them for experimentation or you might only be able to use them internally. You can't turn around and build a synthetic data service with an open source tools and resell it. So first thing is when you're starting to pick up an open source package, red the find and print on the license and understand exactly what your restrictions are.

At Gretel, our open source packages are under what we call a source available license. So if you want to explore with gretel's like core Python packages and our free variance, they are free for any enterprise to build anything with. Our only restriction is that you can't go and build a competing service and resell that service to others. Now with most open source tools, some of the drawbacks are you're mostly relying on community support and then that licensing could change at any moment. Most of these are also in the forms of SDKs, mostly in Python, just because that is where all the core packages are coming from, like the underlying libraries like TensorFlow and PyTorch, most of these SDKs, much like grettel are built on top of that. That does mean it does require a little bit of experience from your developers and operating with those packages.

And then the last part of thinking about using open source tools in your environment is scaling deployment and infrastructure. It's kind of on you, right? You have to figure out how to deploy those packages on the right hardware, how to scale it, how to resource it. And if you're resourced to do that, that's great. I would recommend that if you're going to look at those types of tools, you kind of use them in a sandbox environment. But when it comes down to actually operationalization, you might want to consider moving into more of a vendor cloud hosted or like a cloud prem or a hybrid mode. So vendor cloud hosted like a full SaaS component is great. Easy to get started an experiment, assuming free tiers exist. Gretel has a free tier. You can get in, we have demo data for you, so you don't even have to worry about uploading your sensitive information. And then the, I'll go right to the hybrid mode is the hybrid mode is kind of this mode where you can actually deploy the compute resources in your own cloud environment. And in Gretel, we support deploying a hybrid mode into a W S G C P and Azure. And once you're there, our data plane operates inside of the confines of your virtual cloud. Your data never leaves your environment. The models you train stay in your environment. You can fine tune those models and use them pretty much in an unlimited fashion.

Wrapping up what I covered was kind of different things you used to consider when building a generative I AI program. Step one, think about what modalities you need. You're dealing with tabular data, relational data, image data, time series data. Make sure you're looking around for a vendor that gives you support for everything you need, so you're not kind of having to build stove pipes for each type of modality. Second one is where in your existing workflows do you want to plug in? Make sure that you have the tools to plug it into your existing ML ops workflows and it fits into the processes you already have. And then finally, deployment considerations. Am I doing open source? Am I going full SaaS hosted, or do I need something in between that gives me the ability to keep the data in my environment with a light footprint of compute resources in a cloud that I already use? And that should wrap it up.