Synthesize 2023: Synthetic Data for Training Large NLP Models
How to train large translation models with synthetic data
Generate textbook-quality synthetic data for training LLMs and SLMs
How to get started creating high quality synthetic data for LLM training
How to use Gretel's PostgreSQL connector to automate synthetic data generation
Automating synthetic data generation using Gretel and PostgreSQL
Spence Green (01:09):
Hello everyone. My name is Spence Green. I'm the CEO at Lilt, we build large translation models for enterprise use cases and businesses and government. So that makes all digital products and services available to people on the internet in the language of the customer's choice. So we work with large companies like Intel and Asics, and then we also work throughout the government. Mostly in the DOD and the IC where we have used cases for collecting foreign language information and then helping analysts understand that information. So a little bit about my background before starting this company, which came out of research work at Stanford University. I was a grad student at Stanford and I met my co-founder John DeNero, working on Google Translate about 10 years ago. So it's a super exciting time in NLP right now, I'm sure everybody's seen ChatGPT, which is itself a large language model.
I'll talk a little bit... I'm not going to talk specifically about how one would train a model like that. I'm going to focus more on translation, which is what I know and what we do, but it's the same model architecture. So I'll talk a little bit about the current state of NLP. I'll give a little bit of background on the transformer architecture and then we'll talk about synthetic data for the translation use case. Okay, so NLP is super hot right now. It's the first time I can remember sort of Christmas cocktail conversations talking about NLP in the 15 years that I've been in the field. So that's really exciting for all of us. NLP is an old field and machine translation was the first task in NLP. So going back to the early 1950s, people started building MT systems for Russian to English.
These were rule-based, so you had linguist writing language rules and you put those in lookup tables and then you'd get an input and you look up these rules on a table and you generate an output. And those predictably didn't work terribly well. So we get into the seventies and through the early nineties, there were more hand-built, still knowledge intensive systems, but all still deterministic. So you started to get into using computational grammars. These systems worked slightly better, but still not very effective. In the early 1990s, at IBM was the first time that we started using machine learning. Statistical machine learning to build translation systems. And this was really driven by the abundance of digital texts that appeared on the internet. So now you could start to build probabilistic models for this task that were not dependent on having people write manual rules.
And then that brings us to today, which is the application of neural networks to lots of different language tasks. And really in the last couple of years, the creation of these very large language models. So the architecture that everybody uses in NLP these days is called the transformer. It originated at Google in a paper from June of 2017. So it's about almost a six-year-old architecture and it's used for all NLP tasks. It used to be the case that you would train specific translation systems or named entity recognition systems, and those would be different model architectures. Now, everybody uses the same architecture. For translation, it's got two stacks, it's got an encoder stack and a decoder stack. And on the left-hand side is the encoder stack, and that takes an input. So here an input in French and it converts it to a representation.
And then on the right-hand side, the decoder stack takes that representation of the input and generates an output. Now, large models like ChatGPT are usually decoder only. We won't go into why you would use two stacks versus one, but for the translation task, you get state-of-the-art performance using this architecture of network. And then training on specific translation data, which are, you see here, translation data is a French sentence and it's English translation. That's what you trade in large scale systems on. So like I mentioned, the encoder stack takes the input, which are words, it will convert them to a vector format, and then it learns a representation of the sentence. And so this is what the system sees, which are these vectors of numbers. And they're different ways that you can compute these vectors, which are called embeddings that we won't go into here.
But this is the basic conversion step going from the text to input to what the neural network trains on. And it seems a little bit strange to represent a word as a vector. And so why would you do that? Well, it helps you compare words is the key idea. So you can create a multivariate representation of anything. So you could create a multivariate representation of people, for example. So let's say that we're going to compare people, and the way that we do that is we take some personality traits and we score people based on those personality traits. So here's my vector representation. On the left-hand side, maybe I'm not very open, but I'm very agreeable. I'm slightly conscientious. I don't get negative and I'm slightly extroverted. So we could write that down as my vector representation. These are just numbers that represent these different traits.
And then what we can do is we can take a bunch of people and we can rate them and represent them this way, and then we can start comparing them mathematically. So this lets us compute scores for how similar one person is to another. And the key idea in NLP is that the meaning of words is represented by the context. So the meaning of a word is represented by the context, the linguistic context that it tends to appear in. And what the neural network is doing is learning representations of words based on the context in which those words appear in sentences.
And so this is how these models are robust to phenomenon. Like take the word bank, for example. Bank can refer to the side of a river or it can refer to a place where you store money. And the system in the old world where we represented words as individual indivisible entities, it's very difficult for it to learn how one would use a word in two different contexts. In this world where we're representing words as vectors of numbers, the system can start to learn how a word would appear in different contexts, which is what makes these transformer networks such great models of language.
There is a component of transformer networks called Attention. This was really the big breakthrough. Attention came from a 2015 paper and then it was extended in the 2017 paper that I mentioned from Google. And what Attention does, you can just think about it as creating alignments. So in translation, we look at what are called alignment grids. Here you see a German sentence and it's English translation. And what the system starts to learn is correlations between words. So you see here, it's sort of a heat map. The system starts to learn high correlations between words, and we call that an alignment. And this is the machine learning approach to translation, which is learning these mathematical correlations between different words. You learn that from lots of different data and that gives you a trained system that enables you to translate new words. Okay. So that's how you train these models and this is how you apply them to translation.
So like I said, the transformer simply computes strengths between words, and then in this case you get these alignment grids and that's how it produces new outputs. So that's the learning procedure. Okay, so that's how you train these networks. Now let's talk about synthetic data for how you can augment training. So we'll talk about a couple of different classes of augmentations that we use in training translation systems. We'll talk about rule-based augmentations. These are ones that you can represent, for example, with simple regular expressions that you can just apply to text. We'll talk about synthetic parallel data, which is one way in which you can use very large systems that may not be practical for particular use cases, generate data from them to train smaller systems. That's a method of compression. We can talk about Terminology augmentation, which is when you're changing specific words and phrases in the data and how that's useful.
And then we'll talk about grammatical error correction, which is a product we build called AutoReview, which is a step that comes after translation when you want to start to make stylistic changes. And you can learn that from data too. Okay, let's start with rule-based augmentations. So these are augmentations that you can use with data that are very simple to write down. So one example here, the first one is capitalization. There's some contexts in which you want to have all capital letters. So for example, legal context where you have conspicuous text for a warranty or something. And so what you can do is you can just augment the training data with all caps of the same sentences.
And that will give you a model that's more robust to different contexts where the casing changes. In software settings, you often have variables in text. So these can be things like names of people, places and things. And when they appear on a software UI, the software system will just substitute in a variable. So if you want to train language system on that, you need to have a representation for these variables. And you can do that by going in and putting some markup here, in this case a dollar sign and some curly braces around different spans of text. And then you can train the system on that.
Punctuation is another thing that in some context it's challenging to get right. So you can go in and manipulate the punctuation in the text, and that'll give you a better ability to model different punctuation conventions than you would ordinarily have in the training data. And then the last thing that you can do is, depending on the size of the input, so say you have a system that's trained on shorter sentences, but you know that you're the actual downstream application is full paragraphs. You can do things like concatenating sentences together to build paragraphs and train the system on that. And this works, it's very simple. These are transformations that you can write with a Python script that will augment the data and it really helps. So here's some results of evaluating a system, just the baseline generic system on English to several different languages. And then the system augmented with on all caps text.
And here the test set has all caps text in it. So you see just adding some all caps, very simple one line transformation of data gives you a profound result improvement in translation quality. And this is a translation quality metric that we use in machine translation research called Blue. How it's computed is not terribly interesting or relevant. You should just know that higher is better on this metric and an improvement of two to 3% is worthy of a research paper. So getting a 20% improvement for our Italian here is a very significant improvement. And you usually only see this doing some type of domain adaptation, which is what we're doing here by using synthetic data to adapt to an all caps test set.
So that's deterministic data augmentation. Now let's talk about synthetic data. So one thing that you can do when you're in a low resource setting. So take a language like Dari. Most language systems these days are trained on web crawl. So this presents a problem when you're training a system for a language that's not really written or doesn't appear on the internet. What you can do is you can train systems on lots of different languages all at the same time, which is the concept behind this paper that came out last year from Meta called No Language Left Behind. They trained on 200 different languages at the same time. And you can use a system like that that's trained on lots of different languages to generate synthetic data in a couple ways. One is in a forward direction. So you can do this by taking an input generating bunches of output.
And the other way you can do it is you can go in the other direction, you can take text in the target language and you can generate data, going back the other way. And this has been done for a couple of years now, everybody does this these days. It's sort of surprising for me because the previous generation of translation systems, statistical machine translation systems, this would just not have worked at all. But it's very effective and it's a completely sensible way of building systems these days and it works great. So this shows, this was from the Meta paper showing again, Blue. I think it's Blue. Where higher is better and so on. In low resource settings, you get a significant improvement doing synthetic data generation for lower resource languages. So again, a very simple augmentation. You take a bigger model that's trained on more languages, you create some output with that model, and then you train a smaller model using that output.
Okay, terminology. So you would use terminology augmentation for example in an enterprise setting or a government setting when you have specific terms that you want the system to translate a certain way. Now, a deterministic way to do that would be to just use a regular expression and go in and do word substitution, string substitution and create data that uses those terms. And that is an effective way to proceed. There's another way to proceed these days, which is to take your training data and then augment it with a special token. Here we have some markup that says term set. And then you just append the terminology that you want the system to use off the end of the training data. And then you train the system on that. So this is a softer method of data augmentation than just going in and doing string substitution. And this turns out to just work really great. So this shows results for a baseline system on a test set in blue, and then using this type of terminology augmentation.
So the blue score is on the left, that goes up a little bit. Changing specific terms typically doesn't help you on an overall translation quality score, even at the sentence level. If you're getting a couple of more words that may make a big difference to a user. So like getting somebody's name right or getting a product name right, that may not make much of a difference over a test set of a thousand sentences. But for that particular sentence, in a specific use case, it may make a big difference getting that person's name right. If you look at term recall though, so if you specialize to, does the system use the terms that you want it to use? It just makes a huge difference. So you can see in close to a 10% in some cases, a larger gain on a user data set that we use. This type of augmentation makes a really big difference.
So I'll end with talking about a system that we build in our translation solution. So the way that translation works, typically it's almost like a newsroom where you have somebody who creates an original translation and then you have a reviewer who comes along behind that person and reads it and corrects any mistakes that they find. Any stylistic changes that they think are preferential. And typically the reviewer is a more senior and more skilled translator than the person who does the first pass. And so in this case, the way you can think about this is, say we're translating from English to German, you take an English input, the translator translates the document into German. The reviewer gets both the English document and the German translation, and they make changes to the German translation. So you get an output that's sort of a corrected German translation. Now this looks a lot like a translation problem only with instead of one input, you've got two, you've got the English input and you've got the German input.
And the task is to generate a corrected German output. And this is exactly the way to cast this as a machine learning problem that you can solve with a transformer network. And we had a paper out about this last summer and it works really great. And so this is really interesting because a lot of the changes that a reviewer makes, again, this is like the newsroom editor. So this is the person that's doing the final proofreading before something is published. A lot of those changes are stylistic. They incorporate world knowledge, they're incorporating information that may not be present in the text. And if you can somehow model that, then you can make this whole process more efficient.
You can make the whole workflow more efficient from end to end. And so this is sort of a schematic for how it works. You get the source sentence, you generate... You have an original translation that the translator generated. And then what the system will do is it will take those two inputs and it will generate an output that is a corrected output. And again, all the details of how this works are in this paper that we published last year in Naacl, but it is just a transformer network. So it's the same sort of machine learning setup that we talked about at the beginning of this talk.
And this works really great. We also do some synthetic data generation for training the system. And the way that we do it is we do perturbations. So we will have a series of rules where we will insert misspelled words, we'll drop words, we'll put words in the wrong order, we'll repeat words. And these, sort of mimic the types of mistakes that people make. So for example, if you're translating something and you skip a word, you just didn't notice that this word was in the source, you can create training data that drops words. And then the job of the system is to predict those words that get dropped. So you can do this at both the word level. You can also do it at the character level to simulate sort of spelling errors, for example. They just hit the wrong key on the keyboard. You can put these transformations on the data and then you can train the system on that.
And this is actually key to getting this type of system to work. And this works great. So we published last year, there's some different approaches to doing this task, which is fundamentally sort of a bilingual grammar correction task. And there's some systems that came before our system here. This is an F-score, so higher is better. That trades off between precision and recall. So we got state-of-the-art performance relative to systems that had come before. And key to that was using synthetic data.
So I appreciate the opportunity to give this talk. I think there's a bunch of super exciting work going on in language right now. These systems are getting better at a really rapid rate. Synthetic data is key to getting these systems to work both for tasks where we have a lot of training data, but more specifically for tasks that are low resource. And I think those applications hold in commercial enterprise for sure, but then especially in the government where you have low resource language settings, where we do a lot of our work of training and deploying machine learning systems. And so I'd be happy to take your questions about either one of those. Thank you.