Evaluating Synthetic Math Datasets with Inter-Model Variability
Introduction
Imagine that you’re a teacher trying to make a tough (yet fair) homework assignment to help your students master the material. After grading, you compare the answers from each student to their grade in the class so far.
Consider four different possible patterns you might see:
- All students performed equally well on the assignment, with few or no mistakes.
- The students performing the best in the class tended to make fewer mistakes than the students with the lowest grades.
- All students performed equally poorly on the assignment, with answers that rarely or never matched the answer key.
- Answers differed from the answer key at random, with no relation to each student’s performance in the class so far.
Which scenario would give you the highest confidence that the homework assignment is both fair and reasonably difficult?
Evaluating synthetic data
Validation and evaluation are essential to generating high-quality synthetic data. At Gretel, we combine rule-based tools – like SymPy for math problems, sqlvalidator for SQL data and pylint for python code – with ML-based approaches like LLM-as-a-judge. By layering these approaches on top of the initial data generation, we can achieve much higher synthetic data quality.
If our primary goal for the synthetic dataset is to improve a downstream LLM in a specific area, its ability to achieve that improvement is obviously the most critical validation. We can check this by training a model on the synthetic dataset and measuring downstream improvement; but doing this online during synthetic dataset creation is time-consuming and risks overfitting to the downstream benchmark. Here, we instead investigate whether we can measure the correlation between performance on the synthetic dataset and performance on a downstream benchmark before any training takes place.Â
Think back to the teacher. The students’ grades in the class so far give a reference point for expected performance on the homework assignment. Based on this, scenario (2) is the most aligned with expectations: the students with the best grades made the fewest mistakes. The first scenario, on the other hand, suggests that the assignment is too easy, while the third and fourth scenarios suggest that the homework is either too difficult or that there are errors in the answer key.
Similar to the teacher example, we assess the correlation between performance on a “homework assignment” (a synthetic dataset) and a downstream task, measured against a gold-standard benchmark. Here, we’ll show how examining this correlation can provide a useful sanity check on the alignment between our synthetic training dataset and the downstream task.
Example: GSM8K-Synthetic Alignment
We’ll focus on our GSM8K-Synthetic Dataset, where the downstream task is the ability to solve math problems on the real GSM8K benchmark. For ease of experimentation, we used a 10% random subset of the full GSM8K-Synthetic dataset (1000 question-answer pairs).Â
We began by selecting 20 small (<10B parameter) open-weight base models. Each model is loaded with 8-bit quantization, providing a slight, but consistent handicap on performance compared to the unquantized versions.
We then evaluated each model’s performance on both the synthetic and downstream tasks. On the synthetic task, we use a 1-shot prompt with a question-answer pair from GSM8K-Synthetic:Â
Given the one-shot example and a new question, we measured the log-likelihood of the associated answer in GSM8K-Synthetic. Note that log-likelihood (and perplexity) is typically normalized by the number of tokens; however, here we’re dealing with models that have different tokenizers, so the normalization isn’t consistent or fair. To avoid issues arising from tokenization, we measure the total (summed) log-likelihood across all 1000 answers for the 20 models. The performance of each model is shown in the plot below.
As expected, there’s a strong correlation between model accuracy on GSM8K-Synthetic and the model’s downstream performance on the actual GSM8K benchmark. Interestingly, the relationship between GSM8K and GSM8K-Synthetic log-likelihood appears to be logarithmic; in other words, there’s a power-law relationship between GSM8K-Synthetic perplexity and GSM8K performance. On the other hand, performance on both tasks isn’t necessarily predicted by model size; for instance, phi-2 performs well on both tasks despite having only 2.7B parameters, while falcon-7b performs poorly despite being a 7B parameter model.Â
Nonetheless, GSM8K-Synthetic and GSM8K might both be correlated with another variable, like overall model performance. To check whether there’s a more specific relationship, we look at the correlation between GSM8K-Synthetic log-likelihood and multiple benchmarks below. To avoid making any assumptions about the shape of the relationships, we use the non-parametric Kendall-Tau correlation; asterisks denote p-values corrected for eight comparisons.
Although there’s a positive correlation between GSM8K-Synthetic and multiple benchmarks, the strongest correlation is with downstream GSM8K performance, followed closely by MMLU. This gives us more confidence that GSM8K-Synthetic is tapping into the same math reasoning capabilities required to do well on GSM8K. The high correlation with MMLU is also interesting; notably, MMLU and GSM8K both tend to be strongly correlated with underlying LLM general and abstract reasoning capabilities (Ruan et al., 2024).
Discussion
In this post, we measured the correlation in performance between an existing Gretel synthetic dataset (GSM8K-Synthetic) and the real GSM8K benchmark, showing that (a) there is a strong, logarithmic relationship between the two and (b) the correlation is stronger for GSM8K than a range of other leading benchmarks. This approach can be used to sanity check whether a synthetic dataset is engaging the same set of skills that we aim to improve in a student model.
Of course, there’s no guarantee that training on data that correlates with math performance will actually improve math performance in our student model. In future work, we’ll explore whether these signals can be used at the example level, or even token level, to identify issues in the synthetic dataset and ultimately improve the quality of the trained model.Â
If you have questions or comments about this research, join us and other developers, engineers, data scientists, and privacy fans in the Synthetic Data Community on Discord.
_____
1 A similar idea is explored in Lin, Zhenghao, et al. "Not All Tokens Are What You Need for Pretraining." (NeurIPS 2024). However, Lin et al. compare the same model before and after training on a high-quality reference test, while we take advantage of the inherent variability across pre-existing open-weight models.
2 Ruan, Yangjun, Chris J. Maddison, and Tatsunori Hashimoto. "Observational Scaling Laws and the Predictability of Language Model Performance." arXiv preprint arXiv:2405.10938 (2024).