GSM-Symbolic: Analyzing LLM Limitations in Mathematical Reasoning and Potential Solutions

What The Recent Paper on LLM Reasoning Got Right—And What It Missed.

Introduction

Large language models (LLMs) have made significant strides in various domains, including mathematical problem-solving. However, a recent paper titled "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models" by Mirzadeh et al. raises important questions about the true capabilities of these models when it comes to mathematical reasoning. We have reviewed the paper and found it to be a valuable contribution to the ongoing discussion about AI capabilities and limitations. While the paper offers valuable insights, our analysis suggests that its conclusions may not fully capture the complexity of the issue.

In this blog post, we'll examine the key findings of the GSM-Symbolic study, offer a critical perspective on its interpretations, and explore how synthetic data might address the challenges identified. 

The GSM-Symbolic Benchmark

The authors introduce GSM-Symbolic, an enhanced benchmark derived from the popular GSM8K dataset. This new benchmark allows for the generation of diverse question variants, enabling a more nuanced evaluation of LLMs' performance across various setups. The study's large-scale analysis of 25 state-of-the-art open and closed models provides significant insights into how these models behave when faced with mathematical reasoning tasks.

Figure 1: Illustration of the GSM-Symbolic template creation process.

Performance Variability and Model Comparisons

One of the most surprising findings is the high variability in model performance across different instantiations of the same question. All models exhibit "significant variability in accuracy" when tested on GSM-Symbolic. This variability raises concerns about the reliability of currently reported metrics on GSM8K, which typically rely on single-point accuracy measurements.

Figure 2: The distribution of 8-shot Chain-of-Thought (CoT) performance across 50 sets generated from GSM-Symbolic templates shows significant variability in accuracy among all state-of-the-art models.

What’s remarkable is that not all models are created equal. Llama-3-8b and GPT-4o are clear outliers in that they don’t exhibit as significant of a drop on the new benchmark as other models like gemma-2-9b, phi-3, phi-3.5 and mathstral-7b. This observations suggests two important points: 

  1. Llama-3-8b and GPT-4o generally demonstrate a more robust understanding of mathematical concepts, although they are still not immune to performance variations.
  2. The training data for Llama-3-8b and GPT-4o likely has not been contaminated (or at least not to the same extent) with GSM8K data. In this context, data contamination refers to the unintentional inclusion of test or benchmark data in a model's training set, leading to artificially inflated model performance during evaluation. If contamination had occurred, as the authors hypothesize for some models, we would expect to see very high performance on GSM8K but significantly lower performance on even slight variations of these problems. 

These findings highlight a crucial opportunity for improvement through the use of synthetic data. Properly designed synthetic datasets can address both of these points for anyone training models:

  1. To mitigate potential data contamination issues, there's no need to use the original GSM8K data in training when high-quality synthetic versions can be generated (blog link). These synthetic datasets retain the mathematical reasoning challenges of GSM8K without reusing the exact problems or solutions, thus preserving the integrity of the model's evaluation.
  2. Even more importantly, it's possible to generate synthetic data that surpass the quality of  both the OpenAI GSM8K and Apple GSM-Symbolic datasets. This approach can lead to a more robust understanding of mathematical concepts, addressing the performance variability observed in current models.

By leveraging advanced synthetic data generation techniques, we can push the boundaries of what AI models can achieve in mathematical reasoning tasks, potentially closing the gap between the top-performing models and the rest of the field.

Sensitivity to Changes and Complexity

The authors demonstrate that LLMs are more sensitive to changes in numerical values than to changes in proper names within problems. This sensitivity suggests that the models' understanding of the underlying mathematical concepts may not be as robust as previously thought. As the complexity of questions increases (measured by the number of clauses), the performance of all models degrades, and the variance in their performance increases. This highlights the importance of using diverse data in training, and this is something that synthetics can help with. There is absolutely no reason why a model should perform worse, given a simple change in numbers or a slight variation in the number of clauses.

Notably, the authors show in the Appendix that fine-tuning Phi-3.5 on only a single class of more difficult problems (GSM-P1) results in a performance improvement within that class, but does not improve performance within a different class (GSM-P2). This lack of effectiveness of “narrow” fine-tuning aligns with our experience at Gretel, and our focus on the diversity of synthetic training examples.

Figure 3: The impact of increasing the number of clauses on performance.

The GSM-NoOp Challenge

Perhaps the most concerning finding is the introduction of GSM-NoOp, a dataset designed to challenge the reasoning capabilities of LLMs. By adding seemingly relevant but ultimately inconsequential information to problems, the authors observed substantial performance drops across all models – up to 65% for some. The authors propose that this points to current LLMs relying more on a type of pattern matching than true logical reasoning.

Figure 4: The performance of models drops significantly on GSM-NoOp, with more recent modelsexperiencing a greater decline than older ones.

A Critical Perspective on the Paper's Conclusions

While the GSM-Symbolic study provides valuable insights into the performance of LLMs on mathematical reasoning tasks, it's important to critically examine the paper's conclusions. The authors argue that the observed limitations suggest LLMs are not capable of true logical reasoning. However, this interpretation may be oversimplifying a complex issue.

The paper's argument for LLMs relying on pattern matching rather than reasoning seems less definitive when examined closely. It's clear that these models are not perfect reasoners - if they were, they would achieve 100% accuracy on GSM8K. But the leap from imperfect performance to a lack of reasoning capability is not necessarily justified.

There are at least two potential explanations for why LLMs, like humans, sometimes get questions wrong:

  1. The model tries to strictly pattern match a problem to something it has seen before, and fails if it can't.
  2. The model tries to follow a logical process but faces a (compounding) probability of making an error at each step.

The paper seems to lean towards explanation (1), but doesn't make a convincing case for why this should be preferred over explanation (2). In fact, (2) is more akin to human-like reasoning and potentially more interesting from a research perspective. 

Let's examine each main finding of the paper through this critical lens:

GSM-Symbolic Performance

The GSM-Symbolic approach is a valuable method for dataset expansion, validating the potential of synthetic data generation techniques like those used by Gretel. However, it's worth noting that model performance doesn't completely degrade on these new variants - the drop varies between 0.3% and 9%. If the models were strictly pattern matching, we might expect performance to drop to near zero on these new variants, since the model has never seen the randomly generated variants during training. The observed behavior seems more consistent with a model that can generalize to some degree, but occasionally makes errors, with errors being less likely when it has seen the exact problem before due to leakage.

Varying Difficulty

The paper's findings on performance degradation with increasing question complexity are consistent with the idea of compounding errors in a multi-step reasoning process. As the number of steps increases, on average, so does the probability of making an error at some point in the chain. 

This behavior is observed in human problem-solving as well. For instance, on a subset of the challenging MATH benchmark, even several former math olympians only scored 18/20 due to small arithmetic errors [1]. This suggests that error-prone reasoning, rather than a total lack of reasoning capability, might be a more accurate description of both human and LLM performance.

[1] Hendrycks, Dan, et al. "Measuring Mathematical Problem Solving With the MATH Dataset." Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).

GSM-NoOp Challenge

The GSM-NoOp results, while interesting, may not be as directly related to reasoning capability as the paper suggests. In real-world scenarios, we typically assume that all information provided in a problem statement is relevant. For instance, in the example question in Figure 7, a reasonable human might infer (like the LLMs did) that the size of the kiwis was only mentioned because they were discarded.

Figure 5: Example GSM-NoOp question.

The ability to discern relevant information from irrelevant information, especially when the irrelevant information is inserted with the intent to be misleading (i.e. seemingly relevant), depends on guessing the question maker’s intent; in this case, whether they had the intent to deceive. This is a separate skill from pure mathematical reasoning.

The authors strengthen the argument by including a follow-up experiment (NoOp-NoOp) in which the models are implicitly “warned” of the misleading intent: they use few-shot examples that also contain irrelevant information. The subset of models illustrated with this experiment still show a drop in performance. Several follow-up experiments could serve to better understand the phenomenon:

  1. Expand the NoOp-NoOp experiment to more models;
  2. Measure how well models perform when explicitly warned that some information may be irrelevant in the prompt;
  3. Fine-tune models on synthetic training examples that include irrelevant information in addition to examples that contain entirely relevant information.

Opportunities for Improvement: The Promise of Synthetic Data

While the paper by Mirzadeh et al. highlights important limitations in current LLMs, Gretel's work with synthetic data offers a promising path forward. Gretel has developed key datasets that address many of the challenges identified in the paper:

  • Synthetic GSM8K Dataset: Available on HuggingFace at gretelai/synthetic-gsm8k-reflection-405b, this dataset focuses on generating more complex, multi-step reasoning problems than what existed in the original human generated dataset from OpenAI. It incorporates advanced prompting techniques, including Reflection and other cognitive models, to capture detailed reasoning processes. This approach has shown significant improvements, particularly for very hard problems, demonstrating its potential to enhance AI's ability to handle complex, multi-step reasoning tasks. As covered in our blog, Gretel’s synthetic data created using these techniques achieved a 92.3% win-rate on problem complexity and an 82.7% win-rate for educational value over the standard Llama 3.1 405B parameter model outputs, using these advanced techniques as judged by GPT-4o- demonstrating that LLM reasoning can further be unlocked with more sophisticated training data examples and prompting techniques than the basic Chain-of-Thought used in the paper.
Source: Teaching AI to Think with Synthetic Data & Reflection
  • Synthetic Text-to-SQL Dataset: Another dataset by Gretel, available at gretelai/synthetic_text_to_sql, has proven highly effective in improving model performance on Text-to-SQL tasks. When used to fine-tune CodeLlama models, it led to 36%+ improvements on the BIRD benchmark, a challenging cross-domain Text-to-SQL evaluation platform. Further supporting the theory about today's LLMs being trained on data that is too simple and leading to memorization, a single epoch of fine-tuning the Phi-3 and Llama 3.1 models on this dataset yielded a 300%+ improvement on BIRD benchmark problems labeled as "very hard".

These results show that high-quality synthetic data can be a powerful tool in addressing the limitations of current LLMs in complex reasoning tasks. By generating diverse, challenging problem sets, synthetic data can help push the boundaries of what AI models can tackle, particularly in areas requiring multi-step reasoning and handling of complex, real-world scenarios.

Future Directions

The GSM-Symbolic paper provides valuable insights into the current limitations of LLMs in mathematical reasoning tasks. However, its conclusions should be approached critically. The observed behavior of LLMs could be interpreted in multiple ways, and the paper's emphasis on pattern matching over reasoning may be oversimplifying a complex issue.

That said, the limitations identified by the study are real and significant. The variability in performance, sensitivity to numerical changes, and struggles with irrelevant information all point to areas where current LLMs can be improved.

This is where the promise of synthetic data becomes particularly exciting. By generating diverse, challenging problem sets that push the boundaries of what AI models can tackle, we may be able to develop LLMs that exhibit more robust, human-like reasoning capabilities. The impressive results achieved by fine-tuning models on Gretel's synthetic datasets suggest that this approach has significant potential.

As we move forward, it will be crucial to continue critically examining the capabilities and limitations of LLMs, while also exploring innovative approaches to improve their performance. The combination of rigorous analysis, as exemplified by the GSM-Symbolic study, and creative solutions, such as advanced synthetic data generation, offers a promising path towards more capable and reliable AI systems.