Presentation

Synthesize 2023: Bootstrapping NLP Applications

February 8, 2023

Bootstrapping NLP applications with LLMs and synthetic data

Read the blog post

Transcription

Alexandre Matton (00:05):

Hi everybody. My name is Alexandre Matton. I'm an ML engineer at Cohere and I'm very excited to be here to talk about how we can bootstrap NLP applications with large language models. So during this talk, I will first explain what a large language model is, how we can use them, and then we'll deep dive in inside some very common use cases in industry that can be directly empowered through using large language models. Finally, I will also show some more advanced use cases, which basically entail using large language models to generate synthetic data and improve machine learning pipeline.

(00:50)
First, what is the large language model? So to sum very quickly, a large language model is a neural network. It's a very big neural network and it has been trained on a very big size of the internet. Thanks to that it can understand text and accomplish a lot of different text related tasks like, for instance, writing very fluent texts, extracting some information, plus pairing some information. And all of this work directly out of the box.

(01:23)
Deploying large language model is a pretty hard task because these models are very huge, so they need cleared stores of computers to be deployed. So what this means is that developers generally use them through APIs. Actually, for the developer it makes it very easy because it's only a few lines of code and you can directly use most of the functionalities out of the box.

(01:47)
Here I just copy pasted a small screenshot of how it looked like. For instance, you just need to insert your IPI key and then you can, in one line, ask or choose a model to generate from. Choose a prompt, which is basically the input of the model; so to explain to the model what it has to do. Fix a couple of parameters like the size of the output, things like that, and you get an output in a couple of seconds. You can do this as many time as you want, and so as you can see, it's a very, very minimal setup.

(02:27)
Now that we've seen that, let's talk about some of the use cases that are enabled by this technology. The first one, as you saw in the last slide, is text generation. It's kind of the main use case because that's how large language model were trained. So basically, you can ask a large language model to write about anything with the formality level that you want, the length that you want, and then you should be able to produce some fluent text corresponding to your ask.

(03:01)
Here I just ask the model to write something which looks like a LinkedIn post about starting your career in tech. You can see that the outputs is a text which is pretty fluent and kind of corresponds to what you would see on LinkedIn about that. If you don't like the answer, the great thing is that the output is random, so you can regenerate a new answer until you are satisfied with it. As you can see with text generation, there's already a lot of things you can do. It doesn't have to be a LinkedIn post. It can be really anything and so it really helps you to write text faster.

(03:43)
A second use case which is very dominant in the industry is text classification. So what I did here is I is built very quickly, small span classifier. The way I did that was to put several examples in my prompt. So here I wrote a sentence, "Hey John. It's Emily, how are you doing?" I said that the category was not spam. Then I wrote another sentence, this one corresponding to a spam, and I put spam as the category. For the last sentence, which is the one I want the model to classify. I wrote the sentence, but I didn't fill in the category.

(04:25)
Because large language model learn how to complete the text, the large language model I'm using notes that it has to fill in the blank and that's what it's doing. As you can see, the output it's spam here. So what this means is basically in really a couple of sentences, I was able to make the large language model understand that what it has to do is to classify text into spam or not spam.

(04:52)
Finally, a last very common use case is extraction. Here, I took a very generic example. What I did was copy pasting a paragraph about Paris from Wikipedia. I asked the model to extract all the cities that appear in the text. You can verify the two cities that appear in this text are Paris and London. It was able to do that. Obviously you can do this with a lot of different examples. You could ask a model to extract, I don't know, let's say like prices that you see in the text, to extract the sender of an email, to extract a lot of different information. So there are a lot of use cases that are enabled by this technology.

(05:41)
So here, the three use cases I just showed you are pretty straightforward, very simple. The great thing is that you don't need any machine learning knowledge to make them work. You don't need any expensive setups. You don't need to create training data or anything, and it works directly out of the box with a very good accuracy.

(06:05)
If you want, you can also use large language models for more advanced use cases, and here are some examples where we use them to actually generate synthetic data, which is going to help us to build over machine learning models. Why would we do that given large language models are already very good machine learning models in themselves? There are several reasons to do that.

(06:29)
The first one that large language models, as I explained before, are huge. So they're huge, which means they are costly and also they're a bit slow. What you could do is to use large language model to generate training data so that you'll be able to train smaller machine learning models that are much cheaper, much more efficient and much faster.

(06:53)
Another reason to use larger language models and generate synthetic data is to diversify your training dataset. In machine learning, we know that a model's robustness is very important, and to improve this raw robustness, we need to diversify the example we have in our data and large language models are a very simple way to do that. You'll see that in the next slide.

(07:20)
Finally, the last use case that I thought of with synthetic data is using large language model to create data for tasks that they cannot solve. For instance, let's say the task at hand is to search relevant sentences in a database. Large language models can't really do that because they can't read the whole database and usually you have to use very specialized models. However, to use these models, you need to train them first and that's where the large language model can be used.

(07:54)
So to make all this clearer, let's deep dive into two more use cases. The first one is spam detection. So let's go back to our spam detection idea. As we saw a couple of minutes ago, we can directly use the large language model to create as a spam detection system. But if we want a solution which is simpler and in-house, we can train a small model to do that. To train it we need data. So we need spam data, which is data that we might not have at hand easily. So what we can do is ask the large language model to generate this data.

(08:35)
So here I ask the large language model to generate five examples of spam messages related to company selling clothes. Here you can see they are the model outputs, five different messages that are pretty diversified. The great thing with that is that I can share it with them, which means I can take the one I like and throw away the one I dislike. Also, I can just ask to generate more and more until my training data is big enough.

(09:06)
So as you can see, this is a very easy way to generate a lot of training data. It's much easier to do that than having to come up with data ourselves. So that's a way to build MMM by applying quickly and in making sure that it's going to be robust by improving the diversity of our data.

(09:32)
The second use case of the synthetic generation that I want to talk about is improving search models. So here just to give you a little bit of background, search models are models that match a query with a corresponding document or corresponding text in a collection of passages. There are a lot of search models actually deployed in projections that are very useful.

(09:58)
For instance, for eCommerce websites where customers might input a query, let's say blue sweater. Then the eCommerce websites has to find all the items that correspond to this query. So to train models like that, you need all the passages, all the documents. An eCommerce website will have all the items already available, but you also need relevant queries. If your new website, you might not have queries yet for the customers. So one idea is actually to use the large language model to already generate relevant queries, which will help you to train the machine learning search model already without having interacted with customers yet.

(10:48)
So here I give a very simple example where I took a passage from Wikipedia. I just ask my large language model to generate a question out of this passage. So for an eCommerce website, you could do that by basically as input, giving a description of a project and add output, asking the model to generate a query for this description. So that was all the use case I wanted to talk about.

(11:19)
Here are some conclusion to sum up what I said earlier. First of all, large language models are written technology that, as you can see, can enable any developer to very quickly and very easily incorporate text and understanding tasks like classification, like extraction to their project. What's great about them is you don't need any knowledge in machine learning, you can do that extra easily in five minutes, so no need for all the machine learning data and pipeline to be set up. Model, you can use them for more advanced use cases like generating synthetic data, which will help you to improve over ML pipelines.

(12:03)
Just a small disclaimer about large language model, their output is not always safe, which means that sometimes they can be wrong if you ask them to do something specific. Sometimes they can output sentences that are biased and offensive so large language models are not suited for all use cases. For instance, you should never show the output of large language model directly to your customer without any verification or filtering first. However, there are still a lot of use cases where this technology can be directly applicable and solves use case very easily.

(12:41)
Thank you so much for your time. I hope my presentation was clear and I'll stay a little bit for the Q&A

Synthesize 2023: Bootstrapping NLP Applications

Video description

More Videos

Generating differentially private synthetic text using Gretel GPT

Anonymize Financial Data with a Fine-Tuned SLM

Video - Generate synthetic data for training LLMs and SLMs

Read the blog post

Transcription