What is Model Soup?
Machine Learning has emerged as an extremely powerful toolkit for many industries and problems. By taking data and learning rules, ML can drastically increase the number of things we can do with computers. Part of the alchemy of training these machine learning systems is called “hyperparameter tuning”. There are a number of these hyperparameters that an engineer or scientist may choose to work with, each with different downstream effects. They often work together somewhat mysteriously to increase or decrease performance.
For many years, ML practitioners would train multiple models each with a different hyperparameter configuration and then choose the best performing of all those models. This “final” model supposedly represented the best performance that could be reached given the data and model choices.
However, it was often the case that even better performance could be obtained by using multiple models in parallel and averaging the output of each model. This “ensembling” technique led to improved performance at the cost of running many models at once instead of a single model.
Model soup is a recent discovery Model soups: averaging weights of multiple fine-tuned models... in which an ensemble of models is formed by averaging the weights of the models instead of combining each of their individual outputs. The result is a single model, which is the average of many models with many different hyperparameter configurations.
A key finding from this paper is that uniformly or greedily averaging models performs better than any individual model on the ImageNet benchmark dataset. Additionally, they find it performs better than a naive ensemble might.
In the image above, we see the green diamond models are all trained with different hyperparameters while the blue circle and purple star are the results of averaging the weights of the models together.
At Gretel, we have users with very small datasets (< 100 rows) trying to increase the size of their data by generating more data via our synthetic data tools. Often when there are only a few examples, our models don’t perform up to standard. Therefore, we briefly explored model soups as a way to improve model performance on smaller datasets.
Experiments and Implementation
To determine if this is a direction we want to continue investing in, we explore the minimal viable experiment.
- Train a menagerie of models on chosen subsets of a dataset
- Choose the top two performing models
- Soup the models
- Measure SQS score (higher is better) ∈ [0,100]
Here the two models were chosen by performance which was measured by per character accuracy:
SQS score model 1: 44
SQS score model 2: 42
SQS score soup model: 61
Here, instead, the two models were chosen by performance as measured by loss - but still evaluated on SQS:
SQS score model 1: 40
SQS score model 2: 36
SQS score soup model: 57
To ablate, we tried souping two untrained models and souping the same models, but the SQS score did not improve in either case.
An interesting insight, however, is that we found poor performance when averaging more than 5 models or on certain types models (those with low accuracy to begin with).
We should explore if there is a pattern in this performance to determine when soups are effective or not. Also it would be interesting to determine if there is a threshold for the number of models after which performance begins to improve again.
We replicated a small set of the results around model soups, showing that performance improved as measured by synthetic data quality which may lead to improved performance when customers have small amounts of data. We identified a few areas for more exploration! If this is exciting to you, feel free to reach out at firstname.lastname@example.org or join and share your ideas in our Slack community.