Fine-tune a MPT-7B LLM with Gretel GPT

Learn how to fine-tune and prompt mpt-7b to generate responses matching popular Twitter personalities with Gretel GPT.

In today's blog post, we're going to dive into fine-tuning a multi-billion parameter large language model (LLM) using Gretel GPT. Our focus will be on training the model using recent tweets from your favorite Twitter personalities. The objective is to enable the model to answer questions in a manner consistent with these personalities. It's a fun and engaging experiment and also showcases how we can update a large language model with new knowledge, such as recent tweets, and leverage conditional data generation to create synthetic examples like you might use to train a downstream ML model such as a sentiment classifier or question-answering bot.

This post was inspired by Boris Dayma's HuggingTweets project, which can be found here: https://github.com/borisdayma/huggingtweets. Boris's project was a great starting point for our own work, and we're grateful for his contributions to the open source community.

In our post, we'll walk you through the steps of fine-tuning a large language model using the Gretel GPT API. We'll also provide some examples of how you can use the adapted model to answer questions and generate text that reflects the tone and personality of your favorite Twitter influencers.

We hope you enjoy this post!

Getting Started

First you'll need a Twitter API key or you can use our sample dataset. A Twitter API key is easy to sign up for, and they’re not charging for them (yet) ;-). You will need a Gretel API key to use our model training and inference services. Next, run the notebook in Colab →

Disclaimer: This project is not to be used to publish any falsely generated information but to perform research on Natural Language Generation.

Once you have the necessary keys, the first step is to compile a list of Twitter users you want the model to emulate and a set of questions to ask the model to complete for each personality. For the purpose of this example, we'll curate a diverse group of well-known Twitter users. Then, we'll prompt the model to generate answers to specific questions to see how each different synthetic twitter personality responds to different prompts.

# We grab all available tweets (limited to 3200 per API limitations) based on Twitter handle.


# <--- Enter the screen name of the users you will download your dataset from. --->
HANDLES = ['elonmusk',
           'ylecun',
           'karpathy',
           'fchollet',
           'jeremyphoward',
           'jackclarkSF',
           'Shivon',
           'gretelai',
           'GoogleAI',
           'RickandMorty',
           'KingJames',
           'BillGates',
           'Christiano',
           'Iamcardib']


# <--- Enter a list of prompts to ask these twitter personalities after model training completes --->
QUESTIONS = ["I think that",
             "I like",
             "I don't like",
             "I want",
             "My dream is"]

Our goal is to ensure that the model responds in a manner consistent with each user's personality, while avoiding excessive repetition or content overlap in the answers. Of course, feel free to experiment with your own data and train the model on a specific domain or group of users to observe how the model responds to various questions.

Let's now explore the code involved in this project:

First, we'll download up to the last 3,000 tweets from each user in our list. To ensure high-quality training data, we'll preprocess the text by removing URLs, Twitter handles, retweets, and posts with less than three words.

Once we have the preprocessed tweets, we'll create a training set where each line starts with the Twitter handle name followed by the corresponding text. While it's possible to use separators between the handle and text, it's often unnecessary with larger language models. The desired format for each line is: [handle] [text].

To configure the training process, we'll define a Gretel training configuration in YAML. This configuration specifies the model name, tuning parameters, and project name, which will be used to store the adapted model in the Gretel service after training. Think of the project as a repository within Gretel, where you can store and share trained models with others.

We will use the following config:

config = {
  "schema_version": 1,
  "models": [
    {
      "gpt_x": {
        "data_source": "__",
        "pretrained_model": "gretelai/mpt-7b",
        "batch_size": 4,
        "epochs": 2,
        "weight_decay": 0.1,
        "warmup_steps": 100,
        "lr_scheduler": "cosine",
        "learning_rate": 0.0002,
        "validation": None
      }
    }
  ]
}

Once we have the training configuration ready, we can initiate ‌model training. Training state-of-the-art language models with billions of parameters can be a resource-intensive task. However, Gretel employs techniques like parameter efficient fine-tuning (PEFT) and Low-Rank Adaptation of Large Language Models (LoRA) to enable efficient training on more accessible GPUs, such as NVIDIA A10gs or single A100s. These techniques optimize the fine-tuning process, improving performance while consuming less memory.

from gretel_client import configure_session
from gretel_client.helpers import poll
from gretel_client.projects import create_or_get_unique_project, get_project


PROJECT = "gretel-tweets"


project = create_or_get_unique_project(name=PROJECT)
model = project.create_model_obj(model_config=config)
model.data_source = train_dataset
model.name = f"{PROJECT}-gpt"
model.submit_cloud()


poll(model)

After the training is complete, we can generate queries and send them to the Gretel API. Queries are constructed in the format [handle] [prompt], where the handle represents the Twitter user and the prompt is the beginning of a sentence or question. For instance, you could have a query like "elonmusk My dream is" to find out how the model thinks the elonmusk twitter account would finish the prompt ;-).

To facilitate sending the queries to Gretel, we create a record handler object and format the prompts as either a single-column CSV or a Pandas DataFrame. In this example, we'll use a dataframe.

import pandas as pd


prompts = pd.DataFrame([f"{h} {s}" for s in SENTENCES for h in HANDLES], columns=["tweets"])


record_handler = model.create_record_handler_obj(
    params={"maximum_text_length": 100},
    data_source=prompts
)
record_handler.submit_cloud()
poll(record_handler)

Examining the results

Let’s take a look at how our favorite Twitter personalities complete the prompt “I think…”

@karpathy My dream is to have a small robot that can be a companion for my kids in a few years time and then have it do all the cleaning and cooking so that I can do more fun things with them I love this idea of a companion robot I’m just not sure it will be small and cheap enough 😂🤔😆😁🤷‍♂️😂🤷‍♂️🤷‍♂
@jeremyphoward My dream is to have the world’s largest collection of datasets where each one is annotated and has been published in a paper so that everyone can build and share things with it I m not there yet but I’m working on it And thanks for the support 🙏🏽✨👍🏽👌🏽👌🏽💪🏽🤗😍🤩🥰👌🏽�
@elonmusk My dream is to build a car that will take you to Mars in 2 weeks at 1.6 Gs I also dream of a rocket that will take you to the Moon in 30 min at 1.6 Gs But I’m not sure they’re connected 😅 🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔🤔
@RickandMorty My dream is to play the harmonica with my favorite band I think we should do a show and we should play together I will bring the Rick Morty harmonica and I will bring the Rick and Morty harmonica and we should all play together and it would be a great show and we could record it and it would be a great album and we would all be so happy
@iamcardib My dream is to become a fashion designer 👠💃🏽👠💃🏽👠💃🏽👠💃🏽👠💃🏽👠💃🏽👠💃🏽👠💃🏽👠💃🏽👠💃🏽👠
@fchollet My dream is for the next decade to be a decade of a much more accessible and widely deployed AI industry a much broader AI industry that is more inclusive and has a much higher level of AI literacy in the population This is not going to be easy though and we re just getting started now ☀️🤖👽🧠🤖🧠👽🤖🧠🤖🧠👽🤖🧠

It’s great to see the same model being able to distinguish between the personalities of each Twitter account, and use that to answer questions in the same context. Another neat thing is to see the model focusing on recent topical areas. On areas for improvement, the model seems to really love using emojis. 

Some ideas to further improve on the results

  • Adding more Twitter users or adapting the model for users from a specific domain.
  • Prompting the model to generate results from multiple users.
  • Adjusting parameter settings, such as increasing the number of training steps.
  • Implementing a pipeline to periodically re-train the network with new tweets.

Acknowledgements