Exploring NLP Part 1: Why Should a Privacy Engineering Company Care About NLP?
Large language models present immense opportunities for synthetic text generation. Researchers and practitioners can use these pre-trained models to generate close to human level essays and answers to complex questions. We at Gretel are actively looking at innovations in NLP, so we can learn more about synthetic text. But we are doing so with a critical lens. Researchers are starting to question whether the advances in large language models are worth the risks. These models tend to pick up on systemic societal biases present in their training data. Moreover, training these models has huge financial and environmental costs. Although these models could offer Gretel higher quality synthetic text, we want to acknowledge the ethical challenges and use these models responsibly (if we choose to use them at all).
Over the past decade, natural language processing has undergone a revolution. Practitioners now have a choice from a variety of models, ranging from word2vec to BERT and everything in between. Multiple libraries have robust support for advanced language models, such as SpaCy, HuggingFace, Gensim, and Flair (just to name a few!).
In this blog series, we will explore
- The ethical concerns around ever growing language models and why Gretel cares about NLP
- A novel metric we propose for evaluating synthetic text quality that integrates work from multiple researchers
- Analysis of how various models perform based on our proposed metric
In these blog posts, we assume some familiarity with pre-trained models and fine-tuning. Click here for a crash course.
We at Gretel care as much about the quality of our generated synthetic data as we care about its privacy. When thinking about large language models, we want to make sure that the data being produced will not compromise the privacy of an individual. Carlini et. al. discovered that out of distribution text in the training data, which they call “secrets”, can get memorized and replayed by the model quite easily. One of the benefits of pre-trained transformer models is that it is pre-trained! So, even with fine-tuning, the model would most likely replay the original training data, if any data at all.
But we want to do better than that. Recent research has shown that large language models have a multitude of ethical issues. Many models contain systematic societal biases, such as racism and sexism, that can affect the synthetic data that is generated. This is due to both the explicit and implicit biases that are ubiquitously present in books, the internet, and other sources of text. Because of this, we at Gretel have thought about curating our own datasets and pre-training our own language models. However, curating this type of data set correctly is still an open research problem.
Even if we could create an unbiased dataset (which is quite a lofty assumption!), we’d then have to contend with the environmental implications of training our own language model. Gebru et. al. described that one training run of a BERT base model (approximately 110 million parameters) produced as much CO2 as a trans-American flight. HuggingFace showed that even when researchers are trying to be environmentally conscious with a 200 billion parameter language model, the experiments and training of such a model can produce even more CO2 than the lifetime of a car. Until renewable energy sources are ubiquitous amongst cloud compute machines and training time goes drastically down, the tradeoff for a possibly unbiased model may be too large.
None of this is to say that we shouldn’t try to use large language models. One of our goals at Gretel is to democratize data. By providing an easy and seamless way for our users to generate text from large language models, small companies, nonprofits, and even individuals may have better access to these innovations. If we use a pre-trained model in our backend, we can utilize the previously trained parameters. Thus, using a smaller set of data for fine-tuning or even as little as a few examples for few shot learning (or even none for zero-shot!), many of these pre-trained language models can provide state-of-the-art synthetic text for a much lower cost to our users.
But we first want to know what is the best solution. So we are asking questions like:
- How do we know whether we have generated high quality synthetic text?
- Which models provide us with the highest quality data, while also not taking too long or disrupting our user’s workflow?
- Do the ethical issues these models come with outweigh their utility now that they are here to stay?
In the next blog post of this series, we will explore the ideas behind text metrics and propose a new idea inspired by current research in the field of NLP. By defining what we mean by quality synthetic text, we’ll be one step closer to answering the above questions.