Presentation

Synthesize 2023: How LLMs Can Generate Value for Organizations

February 8, 2023

Exploring AI value creation in the enterprise and how LLMs can help.

Read the blog post

Transcription

Alex Watson (00:00):

Hey, I'm Alex Watson. I'm the co-founder at Gretel.ai. I'm super excited to be here at Synthesize '23. Today, joining me is, Aditya Bindal, from AWS. Aditya, you want to do a quick introduction?

Aditya Bindal (00:13):

Sure. Thanks, Alex. Really excited to be here, at the first step of Synthesize. So my name is Aditya, and I work on Deep Learning at AWS. A lot of the focus these days is on large-scale training, inference, compilers, and how all of these different technologies are enabling generative AI. So super excited to be here, thanks for having me.

Alex Watson (00:35):

Most definitely. Today we're going to cover a series of topics. The overall subject is really how LLMs can generate value for organizations. But diving into that topic, we're going to go anywhere from AI value creation to diving in on LLMs, and things that we see happening in the developer and machine learning space, around LLMs.

(00:53)
And then also how to make generative models work at scale. So how do you make them work for your organization, how to use it to unlock your data. So super excited, and I think right now, I think coming from the unique lens and perspective that you have at AWS, really excited to hear some of your opinions on these. Let's go ahead and dive in.

(01:12)
Maybe at just the highest level, and the question that we hear all the time from users on our platform, and on our community, is just saying, given all this incredible interest around generative AI, you have ChatGPT, you have Stable Diffusion, and Dall-E, and all these cool things that we've all played with and started to use. Where do you see the value for organizations being unlocked, maybe over the next year?

Aditya Bindal (01:37):

It's a great question. I think we've seen a lot of interesting proof of concepts and experiments, where people have been able to do just amazing things with text to image models, and text generation models. And we're starting to see some applications among enterprise customers, where they take these building blocks, and they try and build some end-to-end applications advert, but I think it's not clear as yet where there might be a mode. So to summarize, I think there's no obvious mode at any point in that stack as yet.

(02:14)
Initially, I think people believe it would be just at the model level. So whoever has the best model would have the mode, and they would be so far ahead that no one else would be able to catch up. But I think the model space has become extremely competitive. So you now have multiple providers with proprietary models, you have multiple providers with open source models, you have some enterprise customers that are saying, "I have enough data, I have the expertise in-house, I'm going to go and pre-train my own version of these models and proprietary data."

(02:46)
So I think we've seen that the model level has actually been much more diversity than anyone expected, maybe three, four years ago. Going a little bit above that, I think on the application layer, we're seeing some early wins, like that was the extremely popular Lancer based modification app, which was on the app store, I think number one for several weeks. And we've seen Jasper, and other kinds of applications that have built on top of these foundation models. But even there, I think it's unclear what the end state is going to be.

(03:19)
So for now what we see from enterprises is, they're very interested in proving out value quickly, and then they want to build the end-to-end application, and there seems to be many different options for them, open and closed. So no obvious mode where we can say, "This is where all the value is going to be in generative AI." And that's a good thing in some ways, because it means that more people will get exposed to it, more people will be using it, there's going to be more competition, innovation, people trying to compare different types of modalities, different types of models, and that's what we're seeing on your side. I'd be curious of what you might have seen on the Gretel side with your customers, and all the work you guys are doing in this space.

Alex Watson (04:09):

Yeah, it's a really interesting and quickly evolving area. One of the things that I think, the next step for what we're going to see, for at least from what we're hearing from different businesses and things like that, is given these incredible building blocks, like a GP-T3, or a language model that can learn to do a variety of tasks inside of your business, how do you take that value and apply it to that unique, domain-specific knowledge that you have for your business? And that lasts last 20%, so 80% of the language model and 20% kind of working with your organization's data, creates potentially immense value.

(04:52)
Something that could recommend the next logical action for a patient that came in to see the doctor, and things like that. So you have the ability to reduce insurance costs, you have the ability to give better care to people, and things like that. So one of the questions we see, and one of the things I think is so interesting as these LLMs are evolving, is how do organizations put those to work? How do they use it with their really specific and incredibly valuable data, and how do they make that available to both their business, how to make it available to the customers? It's a exciting area.

Aditya Bindal (05:25):

Yeah. Absolutely. And one of the themes we've seen is large enterprises have become very interested in using text as this universal interface. So one of the interesting things about these foundation models is, it removes the need for a lot of expert level fine-tuning of these smaller models. The way you interact with the model might be that I need to create this dataset, label it, then send it to an existing model like BERT, and then I get some output, and I deploy that into production.

(05:59)
And the zero shot capabilities, which, as you know, refers to the model's ability to generate accurate results without seeing new examples, that is something that widens the accessibility of these models, because anyone can basically go and give text prompts in natural language to the way you and I are talking right now, and get some interesting results back. So it just widens it out. And now we're seeing, not just the machine learning team at large enterprises interested, but also the HR team and the legal team, and marketing, and it's just democratized it very quickly.

Alex Watson (06:39):

That's one of the areas that I think is just beginning to be understood for the applicability of LLMs, is to create interoperability between data inside of a business. So whether you're looking at customer interface, chat logs that you might have, or you could be looking at metrics from your product, or even across databases.

(06:57)
I think that's one of the promising things that we're starting to see, is that it's no longer a massively complicated SQL join that you have to do, to query a couple of datasets and get insights across your business. LLMs, when properly applied, have the ability to create that interoperability layer, and then a very natural kind of techs interfaces, you were mentioning, just to ask it questions.

Aditya Bindal (07:19):

Yeah, and I think that's the big opportunity that we haven't even begun to explore, of how these LLMs will fundamentally change the way companies operate shared data, where they draw boundaries, where tribal knowledge lies and gets unlocked. All of these things, I feel can fundamentally change, because of other LLMs.

Alex Watson (07:41):

Yeah, I think that's awesome. And we actually in conversations with customers have been kind of asking them, "How do you see this data being used?" One of our customers actually brought up a really a neat way to think about it. And they were talking about the generative models trained on their domain specific data, as being the activated form of their data, in the sense that you could query it however you wanted to, you could use it to create new datasets to train machine learning models, you can send it to your business intelligence tools. So it creates a really kind of efficient, compressed version of data for your entire organization. It's really exciting.

Aditya Bindal (08:16):

I like that framing a lot, and it resonates with something, one of the large enterprise customers said recently, when they were thinking about these generative AI models, that before we used to think about these machine learning projects where we get to take our data and bring it to the model, bring it to compute, and now the way we think about it is, our data and our logs are, let's say on AWS and S3 buckets, and data lakes, and we want to bring these generative AI models to the data.

(08:48)
So the data and all the workflows that generate that data stay where they are and then we bring the models to the data, and I think that that resonates really well with what you were saying about just activating all of that data.

Alex Watson (09:01):

That's awesome. Going back to one of the topics that you brought up earlier about the closed versus open approaches with foundation models. So rewinding the past five years, we've gone from a totally, at times open approach with open datasets and open code, to open algorithms but not open data, to a lot of things being back behind APIs right now, and anything in between. Any thoughts or any comments that you would have on that, and what that means for the industry?

Aditya Bindal (09:29):

I think it's a rapidly evolving landscape and question, because as you said, we went from having almost entirely open models, to having lots of closed proprietary models, and now we have open models again. And I think we have customers and partners that are doing both. So we have some partnerships with customers like Stability AI, that have the open source Stable Diffusion model, that they contribute to. We also have some of the broader startups in the Stability AI community and ecosystems that are working on language models as well.

(10:10)
We work very closely with Hugging Face, and they have obviously this huge repository, and GitHub of machine learning models as we describe it. But then we also have partners like Cohere and AI21, that have their own proprietary models. So I think we expect that both will coexist. I think by now we would've seen if there was a strictly open source solution here that would win.

(10:39)
I think we should expect that there will be proprietary closed models that will likely have small advantages in accuracy, potentially in performance, because they might be ahead by six to eight months, or even a to-market, they might have access to some proprietary data. And in many ways the model is not really the data that they use, might be the mode. And if you look at the accuracy benchmarks, I think these are intensely competitive.

(11:11)
So in many ways the open and closed question might be Trump to some extent, by how you can improve on the accuracy and the instructability of these LLMs. One thing that I think is fairly clear to us now is, we won't expect to see only closed models. So if you look at the two extremes, I think we see closed and open coexisting, so we don't expect only open models. And on the other extreme, we don't expect entirely closed models, because there's such a strong incentive and such a strong demand, and so many active communities in discord that want to create open source alternatives.

(11:51)
So we expect that there's going to be maybe a few closed models that show the way, and have the best accuracy, the best performance. And then within EO we expect that there would be different open source communities, having kind of a version of those models with similar performance in size.

Alex Watson (12:07):

Gotcha, gotcha. And when it comes to application of these models, whether open or closed, what would you say are some of the biggest roadblocks that you see, or you see across customers or conversations that you're having, or enterprises successfully adopting and using large language models?

Aditya Bindal (12:25):

I think the biggest blocker is often the zero shot capability is good enough to prove out some concept. So do a quick MVP, and get funding or buy in internally, but it's not good enough to actually build into an application, then deploy at production. So that's one. You typically have to do some amount of instruction tuning to make it adaptable to your domain, to your data. The second is that enterprise customers have a lot of requirements on their production endpoints, compliance, security, how things work within their VPC, what gets logged, what doesn't.

(13:04)
And these are orthogonal to machine learning. These are just true for anything that gets deployed into production. So very often the POC might be successful, but then they have to actually come up with a way to deploy the model within their own environments. And those two can actually be a lot easier with open source, because the model artifact and the weights of the model are available to the enterprise users, and they can then go and modify that. They can deploy it in their own VPCs and their own endpoints.

(13:38)
So we've actually seen some cases where the closed proprietary model providers have to come up with a lot of one-off solutions, or to come up with encryption, so that they can give enterprise customers that reassurance. Now if you're doing kind of quick experimentation, like you just want to go on ChatGPT, try something out, this stuff doesn't matter. But if you want to go and scale this up and say, "I want all of this enterprise customer, and its 60 million users to be able to use this thing," the requirements change dramatically. And a lot of those requirements have nothing to do with LLMs and machine learning.

Alex Watson (14:16):

Yeah, that's one thing that I would say we'd see pretty often at Gretel within our user base and community as well, is that data that you're adding often is both intellectual property, very important to your business, but also has really strong privacy concerns. And particularly in regulated industries, some of the most exciting applications of LLMs that we see happening, and the FinTech, the insurance space, the life sciences research space, data sharing and things like that, are blocked.

(14:45)
And there's a risk of what happens if a model ends up memorizing particular customer information and replaying it, like what if that data can be extracted from the model, or it gets memorized in a way that it shouldn't. So that's one area that we definitely have put a lot of work into and that are also seeing coming up as a really important area in addition to the security and the compliance risk. The putting data to work requires confidence that you're not invading on the privacy of that really sensitive customer data you have.

Aditya Bindal (15:16):

Definitely. And I'm curious, what kind of controls or limitations have you seen customers place because of those reasons you decided because of privacy, what do they to mitigate that?

Alex Watson (15:29):

So across our customers, and it does definitely depend on the use case, but you see this goal of finding a balance between the accuracy of your data and the privacy. So you could create a perfectly private dataset by not training it on anything, and have very little utility, or you could train on the raw data and go for maximum utility of the data. What we find is really kind of looking for a balance, and what we're seeing is the application of really awesome privacy and enhancing technologies being applied into the language model space, into the generative AI space, which has a lot of promise.

(16:09)
Those approaches can either happen, one of the areas we've done a lot of research and experimentation with is actually happening during the model training phase. So it can either happen as you're training the model and talk about how that'll work in a second, or you can really do it on the outputs of the model and making sure that when a model is queried, or it's generating data, that it's not generating things that are too sensitive. So really a post-processing step. One of the most powerful privacy and enhancing technologies that we've seen is differential privacy. And one of the kind of cool things that can be done is, as you were training a large language model-

Aditya Bindal (16:43):

So sorry, do you want to explain what that is?

Alex Watson (16:45):

Yeah. Differential privacy, it's a concept actually rooted in mathematics, and it's used by the U.S. Census Bureau. It's used by, well, even really large examples like Apple and Google, use it with Android phones to make sure it's not... It's learning from user behavior, but it's not memorizing things about that user. And the concept is that if you have a dataset, it's composed and these typically have to be very large datasets, but it's composed of hundreds of thousands to millions of people in this U.S. Census, 350, 400 million people in the U.S.

(17:21)
The idea is, how do we know that no individual's privacy is being compromised by a metric or a model we might share access to? And what differential privacy does, is it inserts noise into a dataset, inserts enough noise that you could, and this is what the metric shows, that you could remove any single user from that dataset, query it with any question, and get with a negligible difference, the same answer. And the idea being, if my data is not inside the census dataset and someone queries it, and I have no impact on it, then there's no way that my privacy is compromised.

Aditya Bindal (17:59):

No, that's an awesome explanation. And we've seen cases where customers had a trained model, it was hitting the desired accuracy, and then it took six months to review and get to that level of confidence, so it passes that threshold and people can say, "Yes, this model is not leaking private data."

Alex Watson (18:18):

Yeah. And differential privacy is not without its own set of risks. And so there's other technologies that'll have a lower impact on accuracy, but when it comes to absolute confidence that a patient's medical data is not going to memorized, it's hard to beat it. One of the other neat things about this, and leveraging some really cool work that came out of the Google TensorFlow privacy team, implementing differential privacy during model training.

(18:42)
So essentially as data comes into the model, it goes into the optimizer, we clip the gradients on the optimizer, and really that has the effect of making sure the model can't memorize rare data, which would apply to a person, and it gives you a lot of confidence there. There's other technologies that we see too. Not everyone wants that perfect differentially private thing, or it doesn't always make sense for your dataset. So other technologies we see in between federated learning, interleaving public data with private data while you're training, all these different things can be put together.

(19:17)
Searching for data that's too similar, or too dissimilar from the training dataset that came in, are all kind of neat technologies that can apply to solve privacy for these models. So maybe jumping on, let's go ahead and talk about the AI application stack. So at AWS you have so many developers that are building on top of LLMs and things like that, and there's these lots of distinct areas, ones that come to mind in between the model, the infrastructure, or even the user interface that's used to query it. Where do you see opportunities coming up for innovation? Where do you see developers really getting maximum bang for their buck on, and investing some time and learning how to use these different technologies?

Aditya Bindal (20:00):

Yeah, I'll start with a example that I personally find interesting, and I struggled with. So if you use stable diffusion, let's see, Stability, AI's application, DreamStudio, and then you put in the prompt, you get an image back, you find that there are these modifiers, Trending and ArtStation, a Unreal Engine, and things like that, where you would keep adding these things to your prompt, and it gets to a point where any prompt you find in, let's say Reddit or Twitter, it just has 14 different modifiers, and no one I think has a good sense of what each one of these modifiers is doing to the image. It's almost become a ritual that you have the actual prompt of the thing that you want, and then you add these 14, 15 modifiers to it at the end, and hope-

Alex Watson (20:49):

Don't forget negative modifiers too also.

Aditya Bindal (20:51):

And negative modifiers as well. And I think for the average user, that's a tough user experience, and prompt engineering has many different challenges, and that people who are now just specializing in becoming really good at defining prompts to these models, but for the average user, that's going to be challenging.

(21:15)
So I think one of the overlooked areas where there's not been enough investment, and we see huge opportunities from customers building these models or building applications on top of these models, is actually in the UI, the user interface and how your end user interacts with the model, and how they understand the way to make these models more robust, what kinds of modifications to prompts can make the image better.

(21:43)
And there are some interesting examples where some applications now offer enhanced prompts, so you type in the prompt you want, and then it auto suggests enhancements to the prompt that might make it better. But I think it's a very complex problem, because when you give a prompt and you get something back, a text output or an image, it's in this semantic space that you can't visually understand or see in an intuitive way.

(22:10)
So you might make one very small change to the prompt, and you could get something completely different, which in your head should not be that different because the prompt itself was so similar, but that semantic space is really far apart. So I think there's a lot of opportunities to innovate in the UI, to make it easier to understand the variability, to make it more robust, to help people incrementally give better prompts, or to remove the notion of prompts entirely and say, "We get the intent, and we're just going to give you the thing that you want."

(22:44)
We've become really good at retrieving and generating, based on how we perceive your intent. And one of the simple examples of that is, I think there's prompt ablation is, there's a GitHub open source project, I forget the name, but it does kind of an automated prompt ablation, and I would fully expect that in maybe a couple of months, you might have some application that says, "Well, within 300 milliseconds, I'll give you the immediate result to the prompt you gave me, and then in 600 milliseconds I'm going to give you an ablated prompt result, or six different ablations that remove one word or two words, or a few phrases, and then give you the result."

(23:32)
And then by two seconds after you submit the prompt, you'll be able to see all of this in a single UI, and you can then understand how the variability works. So things like that I think are really missing, and that's because we're so early, these things are happening so quickly that we haven't gotten to a lot of the UI challenges as yet. But as bigger enterprise customers start using these technologies, I think the UI will become key.

Alex Watson (23:58):

Yeah. Building on that a little bit too, a very similar use case, is in the RLHF space, so reinforcement learning, human feedback. Something so important in making that work, is getting a strong signal from the users of your application, what their preferences are, right?

Aditya Bindal (24:15):

Yeah.

Alex Watson (24:15):

So if you are using reinforcement learning to fine tune your application and the responses coming back, like as we've seen happen in ChatGPT, creating an interface where you can get those preferences in realtime from customers, about how well it's working for them, that's so important to the success of that model, it can really be empowered by having a good UI, that makes it simple to provide that feedback.

Aditya Bindal (24:38):

And that can actually be, well, with the earlier conversation, as part of the conversation about data. Because if you have a good UI, that allows you to capture user feedback, with even a simple thumbs up, thumbs down, do this work for you? That's future training data for you. That's data you can use to tune the model for higher accuracy. So I think someone who figures out the user interface, to make the usability of these models a lot better, and to figure out the hooks that allow you to give feedback into the training, I think that ecosystem or flywheel can be really powerful.

Alex Watson (25:14):

So that brings us to our next topic, which is a question we hear all the time about how much datasets do you need to train LLMs, and where do we see that cutting edge research? Where do we see that model parameter size, versus the training dataset size today? Where do we see that changing? I'd love to hear your thoughts on it.

Aditya Bindal (25:33):

So there's an interesting history here. In 2020, Opening Eye published a paper and scaling laws for LLMs, and they used the GPT-3 training results to illustrate that you should increase model size more than data size. And if I remember the example correctly, if you have 10 times more compute, then you should increase the model size by five X and the data size by two X. And they demonstrated these power laws that seemed to hold, and it resulted in people training bigger and bigger models. So we got some 600 billion parameter models, and people were talking about trillion parameter models.

(26:22)
And then Google had a paper about a year ago that referred to as the Chinchilla scaling laws or the Chinchilla paper, and they said, actually, that might not be true, and you might want to scale the data and the model size differently, and scale the data, and the data scaling is more important than we thought. So what that's resulted in is a lot of models that are still large, so they might be tens of billions of parameters, but they can actually outperform the accuracy of bigger models, because they were training in bigger data sets.

(26:59)
So if you'd looked at GPT-3, I think it was trained on two to 300 billion tokens. And so now we hear about customers training 100 billion parameter size models, but with two to three times the size of training data. So I think there's been a little bit of shift where the focus is now more on training large models, but not these enormous trillion parameter or 500 billion parameter models, and then just training them on more data.

Alex Watson (27:30):

And what are some of the effects that businesses feel when they don't have to fine tune a 200 billion parameter model, you get a fine tune instead, just a 50 or 70 billion parameter model?

Aditya Bindal (27:41):

Well, I think one of the immediate effects is cost, because you can deploy smaller models more easily. So if you take a 50 billion parameter model, you quantize it to reduce the memory footprint, you can deploy it on cheaper accelerators, smaller accelerators, and you can then have a fleet of these machines to do entrance.

(28:04)
If you have to actually deploy a 200 or 300 billion parameter model, then you need to split that graph over multiple GPUs, multiple machines of GPUs, and that gets very expensive. The latency can be quite high. So smaller models have this huge direct benefit by making inference more cost-effective, and making the latencies much more reasonable for realtime applications.

Alex Watson (28:31):

Got it. And if for example, I'm an enterprise customer in, let's say the precision agriculture space, just make up the space. Should I be thinking about fine-tuning one of these large language models, or should I be thinking about training my own LLM from scratch? How do you find that difference there?

Aditya Bindal (28:49):

Yeah, the journey we've seen most enterprise customers take, is they always begin with a pre-trained model, and they try and get a baseline and see if this is promising. And once they get that baseline from a existing pre-trained model, they might think about how they could improve upon that. Now depending on how unique their domain is, they might be forced to pre-train. So for example, we have a customer that was doing things in the legal space, and they had a lot of terminology from these private proprietary contracts, that could just never show up in any LLM training dataset. And so the vocabulary was fundamentally different.

(29:37)
So when they tried to baseline, and then even fine tune, they found that the pre-trained models just didn't contain the vocabulary that they needed. So they had to pre-train the model in their own corpus, because that gave them a model that contained all of this legal vocabulary. So if you're in that situation where you think the training data might not actually represent the type of situations and prompts, and images that you might see in production when you might need to pre-train, but in most cases instruction tutoring would be much easier, more cost-effective, it wouldn't require as much compute, it wouldn't require as much data, and would get you a really high level of accuracy.

Alex Watson (30:22):

That's great. So typically, you get a dataset, instruction fine tune it, and deploy and see how your results look. And then if your data's too different, then you can take a step back and say, "I'm going to think about training an LLM from scratch." In this particular use case.

Aditya Bindal (30:37):

And very often you might even know before you begin trying any of these pre-trained models, how unique your dataset might be. So a lot of these models are trained on sources in Wikipedia, or they might use BookCorpus, they might use a dataset called, The Pile, and you can do a comparison and see what kinds of artifacts and documents are included in there, and how is that different from what you might include in your own datasets.

Alex Watson (31:09):

Some kind of interesting insights on that one too, when you think about how diverse and different data could be. One example we had was working, actually is one of the talks that we have at the conference, is working with Illumina and genomic data. And interesting things that we found are that while genomic data looks radically different than anything that a typical large language model would've been trained on before, you have over a hundred thousand columns for each individual, each containing different correlated features and things like that.

(31:40)
Having a model that has an understanding, that for example, had seen billions of different tables before, could add meaningful new data to support the synthetic datasets being created, even if it'd never seen an actual genomic report before. So it's kind of interesting insight there, but it's also a really good call out that radically different kinds of data, whether it's a different language or things like that, could require you to start from scratch, and training from random weights. And it's not nearly as expensive or scary perhaps, as it was a couple of years ago to do that.

Aditya Bindal (32:13):

Yeah. And I think the recent trends have shown many cost-effective ways in which you can modify these pre-trained models. So one of the interesting approaches is using retrieval based models. So the idea here is you have the LLM model generating text, but then you have a retrieval model that can augment the generation model, and can take a corpus of documents that maybe if you use that same legal example, you have a bunch of court filings, and you want to be able to generate text to write new legal briefs and documents.

(32:51)
So you can use the LLM augmented by this retrieval model, to be able to search and retrieve relevant documents that you provide. So you can think about the benefit of this as akin to fine-tuning. You're making the pre-trained model more accurate for your use case, but you don't have to actually go and change the model's parameters. You don't need to get compute and retrain the model, you just need to use these, retrieval augmented architecture. So there's many interesting approaches now, that I think lower the bar in terms of cost and the level of expertise needed, to allow you to adapt these LLMs for your use case.

Alex Watson (33:30):

One of the really neat things to think about retrieval models as well, is dealing with latency. So if you have a foundation model that's been trained on data up until 2022, but you want to ask it questions about your business, or about the news or things like that, that it might not have been fine-tuned on yet, retrieval models open up the door for you to incorporate very recent knowledge into these models as well.

Aditya Bindal (33:57):

Yeah, and there's a great point there about reliability or robustness as well, because then they can cite some of the documents that are used in the retrieval process. And I think You.com has a really interesting YouChat application, where it cites its sources, and I believe it's using a combination of LLMs and retrieval.

Alex Watson (34:21):

Yeah, so powerful. And that's one of the other areas that we hear a lot from customers about. It's like, how do we trust the output to the LLM? And how do we know that it's telling me something that's truthful, or rooted in fact, or is it just telling me what I think I want to hear? And so that ability to cite, for example, I think WebGPT kind of helped kind of pioneer some of this, at You.com is doing some neat stuff there as well.

(34:46)
We hear this as well, if you are asking your organizational LLM a question about your business, you might want to know where it got that data from. So being able to point to the data sources that it came from is super powerful, kind of a really interesting idea.

Aditya Bindal (35:00):

So I feel like that tee's up really well, something I'd love to hear your thoughts on is, how do you think about robustness more generally with these types of foundation models, and how can you make the models more robust, reduce the potential for harm, what are your thoughts on that?

Alex Watson (35:21):

Yeah. We think about robustness with a model, and really two dimensions to start with, and then it expands into more dimensions from there. The first being accuracy. So given that it was trained on a dataset that may have different biases inside of it, the most important thing we're doing is not introducing new biases into it. So you want to start really understanding the data you're trained on, how well did this model do, learning the data was trained on? And with that, I think that AI and alignment truthfulness, and I think fairness, all fit into that category for accurate data.

(35:57)
It's aligning the outputs of that model with the culture, and the use case for the company. So I think that's a really important part, and we'll talk about how that works in just a second. I think the other part there is privacy. I think often historically privacy's been thought of as a blocker for data. It's a reason that you can't access data, or you need to mask it, or drop the data, but I don't think that's the case anymore. I think privacy enhancing technologies are so good that you can view it as an enabler to data in a way to access data, share it more freely and access it, and more directly.

(36:30)
And as we were talking earlier, it's kind of finding that right mix between accuracy for the data you're trained on, the goal you're trying to do, and it can kind of dive into how this applies to bias right now. Right now, if you look at the model card on any model that's been published, that go down there on fairness and bias, and it's going to say that this dataset was trained on bias data, because all data is biased, you just have unwanted biases and wanted biases in there.

(37:00)
And then the question, and really that state of the art today, is that you have your researchers or your team that are deploying this model for a particular reason, like analyze the outputs of the model and say, "Is it introducing or perpetuating any unwanted biases that we have?" So it's a very manual step right now. I think one of the things that I'm personally really excited about is the idea that we could actually use LLMs to police the outputs of LLMs. So it may not always have to be this really manual process. What if through our LHF, for example, or by hierarchy of different LLMs, you could ensure that the output of your model kind of met the fairness requirements and the distributions that you're looking for.

(37:45)
That raises some eyebrows, when you start talking about LLMs and you're like, "Wait, I thought I heard in the news that this particular LLM could be used to say some really kind of unwanted or biased content. And my experience working with this, and I'm curious about yours as well, is that LLMs, when you prompt them, what they're really doing, is they are trying to answer the question, almost in the tone that you asked it. So it takes on a personality of something it's seen before, and says what it thinks is the most logical answer for that.

Aditya Bindal (38:15):

So it's completing the sequence.

Alex Watson (38:18):

Absolutely. Yeah.

Aditya Bindal (38:19):

Yeah.

Alex Watson (38:20):

And from our initial research and area we're spending a ton of time on right now is saying, what if you ask it to look at the output of an LLM and say, "Does this seem fair? Are there certain columns in here that might introduce unwanted biases?" And the initial results are very promising and warrants a ton of research, and kind of diving in, but it does kind of give us some hope that there could be some automation to detecting this really complicated and nuanced fairness and alignment for AI models.

Aditya Bindal (38:49):

I'm curious, do you think there's potential for having some kind of standardized evaluation criteria, that will allow customers to say, "If it passes this criteria, then I know that it's not terrible," then we might need to do more, but as a baseline, I'm reassured. And I'm thinking of this paper from Stanford recently, the Holistic Evaluation of Language Models, the HELM paper, and they had some interesting benchmarks.

(39:21)
I think they have a nice website now that shows you how all the different models rank. But in your experience, especially with some of the things you've talked about around using data or using LLMs, what could that evaluation criteria be? Or do you think such a standardized evaluation criteria could even exist?

Alex Watson (39:41):

Yeah. Since the first time that we created a model that could recreate tabular data, like three years ago, and you created a new dataset and you looked at it, the very first question we got from our users was like, "That looks great. How accurate is it?" And that's something I think that we've evolved quite a bit. I think if you look at evolution generative model space that's happened, I think it started with understanding the accuracy. So how does the distribution of outputs of this model, whether I'm using conditional or unconditional data generation, match up against the original data was turned on.

(40:15)
And the second dimension that we've seen on top of there, and I think those metrics are getting pretty robust, which is really kind of exciting to see, where you see things like, how close is this to the original distribution that I had, whether using PCA or looking at some sort of dimensionality reduction technique, or something like that to compare them. It's a pretty well interested problem. Privacy, is the next step, how private is my data? And we hear about that quite a bit. And there are the mathematical guarantees you can get for differential privacy.

(40:45)
They don't always apply. And even when they do, what we've found is that our users are looking for trust but verified. So essentially, really kind of interrogating the model and attempting to extract secrets from there, so they have a sense of confidence in saying, "Yes, I'm happy about this being shared." And I think what you were just getting to is the kind of newest, and also very important area of how do I understand the alignment of the model for the use case, or the tasks that I'm applying it for. All these are really important, and really what we see them happening is at two different levels.

(41:23)
One, you want to make sure that your model as it's being fine-tuned, is meeting those constraints or expectations you have. The second thing is, after your model's deployed, and it's generating data, is it hallucinating new things? Is it generating examples like what I want, and ones that I can trust on? And one of the things I think that's fascinating, we've all experimented with Dall-E, and Stable Diffusion and things like that. You generate images, you get a couple that are great, and then you get something that's just totally weird.

(41:50)
And as you think about how enterprises are going to adopt generative AI, creative applications are definitely up there. But I think at the same time what we see our user base doing, is using it to train machine learning pipelines, for example. And you're not generating one image at a time, pre-machine learning pipeline, you're generating thousands of images to help your model respond to data it's never seen before. And for that to work, you have to really trust that the outputs of that model, whether it's summarizing text, or it's generating new images, kind of match those expectations for the users.

(42:22)
So I think a really other important part is really scoring the outputs of your generative model, whether it's images, whether it is text summarizing, for example, like a post or a sales call transcript, or whatever it is, you want to make sure that it's accurate. And that's a whole different class of things. We call them validators, but essentially look at the output of each individual model and say, "Yes, this matches my expectations, or not."

Aditya Bindal (42:54):

I think it also goes back to the conversation we had earlier about UI, and I think the user interface can play a big role in reducing the chance of harm and trying to improve accuracy. So a good example of this is, there are things you can do at the prompt level. So if someone asks for a prompt that might be kind of malicious or looking to generate questionable content, there are things in the UI you can do that actually catches that early on, and that's not even a model issue now.

(43:27)
And then once you get the output, I think there are things you can do within the UI to just continually get that feedback. So if you have a thumbs up, thumbs down mechanism, and you can use that to feed something back. And to your point, I think we become good at building models that can classify the output of other foundation models. So the way that we've seen customers think about this that really resonates with me is, we should think about building engineering systems end-to-end, that treat this as a ongoing problem or a metric that you need to minimize.

(44:02)
It's not a box you need to check, it's not a gate you need to go through. You need to have some kind of realtime deployment. And we've actually to that end been building tooling within SageMaker, our fully managed platform for machine learning, so that we can have some of these explainability bias detection tools built into your training and your hosting endpoints.

Alex Watson (44:27):

That's I think, for generative AI, such a cool premise. And thinking about the integration with different ML ops frameworks like SageMaker, Vertex, MLflow, CoopFlow, like that whole kind of ecosystem. One of the challenges that we hear from customers is even outside of our users, outside of the testing that they do, when the model's deployed, invariably the types of inputs the model get is going to change.

(44:51)
So being able to detect that and knowing how the model's going to respond, and even using, one of the things we're really excited about is using generative AI to essentially build new training sets, so you can keep your models deployed longer.

(45:03)
You could imagine, for example, you have a model that is deployed as a classifier, or as a generative model that is doing some sort of sentiment analysis task. You see the data changing, and essentially what your goals are is to keep that model deployed and keep it performing well.

(45:17)
And one of the cool things about generative AI is, if you see a shift in the type of data coming in, you can ask it to generate more data in that domain to help it respond to the types of things that it's seeing, and incorporating model monitoring, automated retraining and things like that, could shortcut massive downtime that so many people have when you have to go back to that whole data collection process.

Aditya Bindal (45:37):

And I'm curious to build on that, how do you gain confidence in that process? If let's say you had a data scientist that at a company go to an executive and say, "I think we can invest and build these synthetic datasets to augment or even potentially replace retraining the model when it drifts." How do you know that's going to work beforehand, how do you get confidence in that mechanism?

Alex Watson (46:04):

Usually, when this happens, there's real money associated with it. So if you are a cybersecurity company and you're detecting attacks, or you are a FinTech company and you're looking for fraud, for example. There is real money associated with getting these predictions right or wrong. So it is definitely an iterative process. We found that running a suite of experiments, for example, with a model using it to generate a lot of different examples, can both help you understand what that shift in data is and why the model's helping, which is really kind of the key to having confidence in it.

(46:42)
But then also over time just running, prove itself as something that should be part of the new kind of ML workflow or process. So it's an evolving area right now, but I think the relative inexpensive cost of deploying generative models to create new datasets in parallel and seeing how it performs against your existing models, is a way to prove that out over time. We've had really good success, initial success so far, and we're looking to over the next year kind of tighter integrations with different ML ops frameworks, and really kind of getting this into the process that people build with.

Aditya Bindal (47:15):

Are there any rules of thumb? Maybe this is kind of analogous to the Chinchilla discussion we had. Like, when am I approaching my limit, that if I 10 x from now, I'm not going to get much more benefit?

Alex Watson (47:27):

Yeah. We see a couple of different ways that people augment a dataset to make it better. So if you have real-world data and you're trying to make it better, really it starts the same process each time. Train a generative model, or fine tune a generative model on there. And often when there's a class imbalance, this is an area that's particularly suited for synthetic data augmentation.

(47:50)
So if you have fraud, to go back to a earlier example, or you might only have a few examples of something bad, being able to boost those few examples and create a thousand more, for example, can have real impact on your downstream analysis. So today it's a bit of an art, a mixed science and art, but definitely a little bit more on the art side. But it is an area we see a lot of initial success in, and we're learning and building as we go.

Aditya Bindal (48:17):

Awesome.

Alex Watson (48:19):

Well, thank you today, Aditya, for joining us. A awesome discussion. I'm super excited about both where the generative AI space is today, and where it's going to go over the next year. So we'd love to welcome any questions from the audience, and thank you for the time.

Aditya Bindal (48:34):

Yeah, thanks so much.

‍

Synthesize 2023: How LLMs Can Generate Value for Organizations

Video description

More Videos

Generating differentially private synthetic text using Gretel GPT

Anonymize Financial Data with a Fine-Tuned SLM

Video - Generate synthetic data for training LLMs and SLMs

Read the blog post

Transcription