The explosion of small language models (SLMs) and license confusion

Rapid SLM releases highlight the need for clarity on licenses + lineage, which are crucial for enterprises navigating open-weight models and synthetic data ownership

‍TL;DR:

  1. Open-weight models aren’t the same thing as open-source models
  2. Most open-weight models come with significant restrictions on model and output use
  3. Model licenses have implications on synthetic data produced
  4. High-quality data shouldn’t come at the expense of data ownership considerations ‍
  5. Generating best-in-class synthetic data is a much harder problem than making a single call to an LLM; it requires a compound AI approach

The state of small language models

Performant small language models (SLMs) entered the GenAI stage in 2023. We are only half-way through 2024, but SLMs and their use-cases have absolutely exploded. If new pretrained models were released roughly every month in 2023, we are now seeing releases almost on a weekly, if not daily basis. Last two weeks alone saw SmolLM, Codestral Mamba, Mathstral, Danube 3, Mistral NeMo, and Llama-3.1 announcements. And this is just the foundation models, not models further fine-tuned for specific business use-cases! The latter is where the real explosion is happening due to the relative ease with which SLMs can be modified on a very small budget (e.g., see the MosaicML blog post on training from scratch, and LoRA fine-tuning blog post from Predibase).

Figure 1: Explosion in SLMs (Small Language Models) over the past two years 

An unintended consequence of the explosion in small language models is the increase in confusion around model licensing. In the past, only major LLM providers had the resources to bring a language model to market (e.g., Meta with Llama series, Microsoft with Phi series, Google with Gemma series). Today, small companies and even individuals are able to fine-tune and often pre-train SLMs on their own. However, individuals and smaller companies are often unable to perform due diligence with respect to models and licenses. Mix in a frequent misunderstanding of how licenses work and what they cover, a misuse of the term open-source, occasional willful ignorance of licenses, and you end up with a giant tangled licensing ball and a whole lot of confusion. [Side note: what one really needs here is proper lineage with respect to datasets and models, but that’s an entirely separate blog post in itself].

Closed-source, open-weight & open-source

Let’s start with the basics that lately get confused left and right: open-weight vs. open-source vs.  closed-source models.

  1. Closed-source models are probably the easiest to understand. Their development is usually controlled by a single entity or company. They are typically available solely via an API or deep integrations with cloud providers (e.g., Amazon Bedrock). They cannot be modified or redistributed without permission. In most cases, you have to pay to use them and they are NOT available via Hugging Face to be downloaded and inspected. Models from OpenAI, Cohere and Anthropic are examples of closed-source models. Closed-source models often have licenses with the provisions such as this from Open AI:

    “... you may not … Use [model] Output to develop models that compete with OpenAI”

    What’s deemed to be competitive is up to interpretation by OpenAI, and could include any model that is taking business away from OpenAI. Put it differently: you can’t legally just take data generated by OpenAI and customize your own model with it so you can cut down on costs.
    ‍
  2. Open-weight models are those models for which the weights (aka parameters) of the neural network are publicly available. Every model publicly released on Hugging Face is an open-weight model (models published by companies in their private collections aren’t open-weight). However, not every open-weight model is open-source. Open-weight models often come with specific licenses that may restrict modification, distribution, commercial use and/or require attribution. Llama-2, Gemma-2, and LLama-3 are examples of open-weight models. They typically carry a custom license that restricts the use of the model. For example, provision 1.b.v of the Llama 3 license states that

    “You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Meta Llama 3 or derivative works thereof).”

    Provision 1.b.i states that if building a product with Llama-3 or distributing the model/any derivatives of it,

    “you shall (A) provide a copy of [the Llama license] and (B) prominently display “Built with Meta Llama 3” on a related website, user interface, blogpost, about page, or product documentation… you shall also include “Llama 3” at the beginning of any such AI model name”
    ‍
    Lastly, if using the model for commercial purposes and if
    ‍
    "… the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta”

    ‍
    All of the above is vastly different from a true open-source model, yet many people casually refer to Llama models and many other open-weight models with a proprietary license as being open-source. Imagine Python (a truly open-source programming language) having similar provisions. Then we would have quite an exotic zoo with Python Llama, Python Mistral, Python Falcon, etc.  While it's common for open-source licenses to require attribution, mandating specific branding and naming conventions goes beyond standard practice. It ensures that a model gets visibility and recognition and is contrary to the spirit of open-source, which values flexibility and community-driven development.

    Parsing the Llama-3.1 License Update 🦙

    The just-released Llama-3.1 model is one of the top performing LLMs in the space, achieving remarkable results and substantially shrinking the gap between SOTA closed-source models and open-weight models. It also comes with welcomed and important changes in its Llama-3.1 license that remove provision 1.b.v and allow for the output of the model to be used to train other language models. However, similar provisions regarding commercial use, providing the license to users and including the name in the model are still in place. This is not a license that's recognized by the Open Software Initiative.
    ‍
    Removal of provision 1.b.v is one of the main reasons Meta talks about Llama-3.1 unlocking synthetic data generation. However, many other SOTA LLMs already allow for synthetic data generation. The reality is that generating high-quality synthetic data is a much harder problem than just making a call to an LLM. That’s precisely why we’ve been focusing on a compound AI approach to synthetics. It doesn’t matter what the latest model innovation is: a compound AI system built on top of an LLM can always generate much higher quality data by leveraging other models and tooling alongside that LLM. We are excited to be able to use Llama-3.1 at scale now as part of our synthetic data platform and provide guidance to our customers on the model restrictions. In fact, we have already rolled out Llama-3.1 as one of the models available in Gretel Navigator!
  1. Open source models are a special subset of open-weight models. Specifically, not only are the weights publicly available, but the model is freely available for anyone to use, modify and distribute, including for commercial purposes. Open-source models often come with a permissive license like MIT or Apache. The key idea is that the model is free to use and build upon, and in many cases without really having any strings attached (strong copyleft licenses like GPL and AGPL are trickier to navigate). A few examples of open-source models are the Mistral models like Mistral-7B, Mistral NeMo, Mixtral-8x7B as well as the Phi-series of models. Mistral models were released with an Apache 2.0 license and Phi models were released with an MIT license.

    The ability to build with truly open-source models and use the output generated by them really sets them apart from just open-weight models. For example, one cannot take the output of Gemma/Llama models, use it to fine-tune another base model which happens to be open-source and publish that new model with an open-source license. Doing so would violate many license provisions. Unfortunately, we see many real-world examples of this.

To help understand the model licensing landscape and shed light on what we see in the world of open-weight models today, we used July 2024 data from Hugging Face. We visualized all of the model licenses currently available as well as the number of models for each license type. We used a Sankey diagram to help group various types of licenses into open-source and non-open-source categories as well.

The model licensing landscape

Figure 2: Hugging Face Model License Categories for July 2024

As of July 2024, there are 784,000 open-weight models on Hugging Face. However, license information is available for only 289,000 thousand of them, or 37% of the total. It is fair to assume that 63% that are missing license information are purely open-weight, i.e., not open-source (we are taking a practitioner's approach to this: anyone contributing to open-source will make sure to include a relevant open-source license; if the license is missing, the person/company either wasn’t sure what the license should be or forgot to include it).

For open-weight models with license information, 34% were NOT open-source. However, we strongly suspect that this number is severely underrepresenting reality. For example, the Llama2 license accounts for only 2% of all models with a license, which is quite low, considering Llama’s adoption. Examining top Llama models by the number of downloads, there are quite a few that assign a non-Llama license, like Apache 2.0 or CC-BY-NC-4.0, despite these models being derivatives of Llama. Combining this with models that don’t have a license assigned, we are looking at a total of 75% of models that aren’t open-source. Only 25% are open-source, and this is an overestimate due to a wide-spread license confusion.‍

Among the open-source models, Apache 2.0 takes the lead with 42% of models, followed by the MIT license at 18%. Combining with models that don’t have any license attached, that translates to only 16% of models having the Apache 2.0 license and 7% having the MIT license. Again, these numbers are an overestimate, highlighting a huge opportunity for companies to contribute back to open-source.

The data licensing landscape

In addition to model licenses, and for completeness sake, we looked at licenses of all 188,000 datasets published on Hugging Face as well. We wanted to do so because datasets are increasingly being discovered on Hugging Face and used to train and fine-tune language models. This has implications for model licensing, especially as more and more of these datasets are synthetic in nature, i.e., they are being generated by LLMs.

Figure 3: Hugging Face Data License Categories for July 2024

While the percentages look similar, there are important differences. First, the number of datasets is more than 4 times smaller than that of models, with only 180,00 datasets published. We expect this to change in coming years with more and more synthetic data published on Hugging Face. Second, the portion of open-source datasets is bigger, at 74%. At the same time, data generated with popular purely open-weight models like Llama total less than 1%, which is likely indicative of llama-generated datasets being assigned an incorrect license. In other words, 74% is an overestimate of the reality, and that has real legal implications for people looking to leverage data to customize their models, and the type of license they can assign to their work.

Protecting data ownership

At Gretel, we pay a lot of attention to licensing. Since we are a synthetic data platform, our customers expect to be able to own the data they generate and use it for their downstream tasks. They place a lot of trust in us to do the hard work of making sure licensing is respected – something we don’t take lightly. This is especially important for companies in heavily regulated industries, but also enterprises more generally. We also have to pay a lot of attention to licensing as a business because we don’t use just one LLM or tool: our compound AI system leverages many different tools and LLMs, including custom LLMs, to generate the highest quality of data possible. High-quality data shouldn’t come at the expense of data ownership considerations. The data generated with Gretel is yours to use for your downstream applications, and we provide guidelines with respect to any restrictions. 

Go ahead, give Gretel Navigator a try! And if you find the world of SLMs, compound AI systems, licensing/ownership, and synthetic data interesting, reach out to us! We're hiring.