Optimize the Llama-2 Model with Gretel’s Text SQS

How Gretel's data quality analysis tools for evaluating generated text can help you optimize the performance LLMs, like the Llama-2 model.

In this blog, we’ll show how using the Gretel Text SQS can help with quality evaluation of the generated text records as we fine-tune the Llama-2 model. 

Fine tuning and generating synthetic text data with LLMs can be quite time consuming.  Evaluating the quality of the results manually and updating the parameters according to that can make this cycle even slower. By automating the process of optimization and evaluation using Gretel’s text SQS, developers can focus more on building and uncovering insights.

We first describe the text dataset and its components, then we’ll observe the text scores as the model is fine-tuned in different steps. At last, we take a sneak peek into the generated records at each step and compare them with the text evaluation score to see how the score relates to a human evaluation. You can try the experiment with your own dataset and monitor the results using this notebook.

Our Training Dataset

SAMSum is a text summarization dataset which contains a dialogue and its human written summary in English. Text summarization has many use cases including training chatbots for better question answering, as well as analysis of literature, research  or  legal contracts. In this experiment, we combine the dialogue and summary into a single column for each example since Gretel’s gpt-x model expects each record to consist of a single free text field. We use a sample of 1000 records with text lengths less than 100 tokens to keep the model fine-tuning duration under 30 minutes. 

The following is a snippet of the text training dataset:

Figure 1. Dialogue snippets from our training dataset.

A Few Steps to Fine-Tune a Better Llama-2 Model

It is vital to generate text dialogues and summaries which are semantically similar to what they are trained with. To do this, we use Gretel’s Text Synthetic Quality Score (SQS). This metric is a combination of the semantics and statistics similarity scores between the generated and training dataset, which is monitored as we fine-tune the Llama-2 model with the default configurations while changing the “steps” parameter. 

As seen in the following figure, gradual increase of steps in lower ranges, improves the learning process, which results in better text SQS. Score values above 80 are considered “excellent.” After a certain step value, the text quality remains almost the same with no significant improvements.

Figure 2. Generated text quality improving with increasing steps.

Let’s take a look at the first 5 out of the 80 generated records in a few steps:

The following snippet shows the generated text records when step = 250 and Text SQS = 38. The records are a mix of NLP prompts, code snippets, some non-english texts and punctuations which are completely unrelated to the dialogue-summary training dataset.

Figure 3. Generated text that has a low quality score of 38.

Let’s now look at the results when the step = 300. There are still some records including unnecessarily punctuations (other words besides “dialogue” and “summary” are surrounded by “**”). Most of the records start with “dialogue” followed by “summary” which shows the model is in the learning process, although the summary is repeated more than once in some examples, so there are still some places for improvements. Text SQS is improved to 66.

Figure 4. Generated text with an above average quality score of 66.

The following shows the generated records when the step =700. “Dialogue” is observed in all of the records while “summary” is in more than 90% of them. Also, “summary” is a suitable representation for the “dialogue” and no further punctuations are observed. Text SQS score is also improved to 80.

Figure 5. Generated text with a high quality score of 80.

When setting the steps to 1500, the text results are pretty much the same as the previous one, and same with the text score(82). We still observe the presence of “summary” in more than 90% of the generated records:

Figure 6. Generated text with a quality score of 82.

Note: For the last 2 steps, we can always add an extra evaluation step that only generates  the records which include both the” dialogue” and “summary.” Also, considering the model trained with datasets less than 100 tokens, some generated long records are not complete sentences, mostly happening in smaller “steps.”


Looking at the above results, we can conclude that Gretel’s synthetic text evaluation score is able to measure the quality of the generated records precisely. Also, after certain steps of fine tuning, the text evaluation score and quality of the generated records remains the same, hence we set a default constant number of steps in our gpt-x configuration. Try the experiment with your own dataset and monitor the results using this notebook. If you have any questions, reach out to us in our community Discord channel. Happy synthesizing!