Generate textbook-quality synthetic data for training LLMs and SLMs

How to use Gretel's Tabular LLM for generating diverse, high-quality training data to create better language models.

In the notebook and video walkthrough below, we will leverage Gretel's Tabular LLM to generate new diverse, high-quality training examples building on these techniques to create better LLMs. Our goal with this notebook is to demonstrate how to get started creating high quality synthetic data for LLM training, and facilitate further research into safeguards for completion models.


Recent research has shown that training small, efficient language models (SLMs) on high-quality, diverse data can achieve state-of-the-art results- even rivaling or surpassing LLMs 5x the size such as Llama2-7b and Falcon-7b on common tasks, as demonstrated by models like Microsoft's "phi-1.5" (from their paper "Textbooks Are All You Need"), Orca2, and IBM's Granite. Using similar techniques, we'll demonstrate ways to inject randomness into the prompt in a way that gives rise to the generation of diverse datasets.

Creating diverse training data is challenging, but vital to reduce overfitting and improve generalization. Techniques like including random word subsets in prompts, as done in TinyStories, will be used.

Compared to models trained on web data, “Textbooks Are All You Need II” highlights additional advantages from using textbook-like data: "the model seems to store and access the knowledge more efficiently" and it has an "attenuating effect on toxic content generation." However, as the authors note, "although phi-1.5 has a lower propensity for generating toxic is not immune." They posit phi-1.5's reliance on synthetic data "provide[s] a useful platform for exploring these challenges further."


Before diving into the notebook, there are a couple of prerequisites:

  1. Gretel API Key: You'll need an API key from Gretel. If you don't have one already, you can obtain it from Gretel's console. This key will enable us to use Gretel's services for generating our synthetic datasets.
  2. Access to Gretel's Tabular LLM: To utilize the specific features of the Tabular LLM, you need to have access to the early preview. If you're not already signed up, you can request early access at Gretel's Tabular LLM page.
  3. Domain-specific training data: To try this approach with your own data, you'll need a LLM training dataset in a standard input / output format, like you might load from HuggingFace or use to train your model. Or, get started quickly with the example below using the databricks/dolly-15k dataset.

Let's get started!

Here's the Colab notebook, and a video walkthrough to guide you.