Introducing Model Suites for Synthetic Data Generation

A new standard for ensuring regulatory compliance and managing the complexities of compound AI systems.

Published by

•

Updated

December 11, 2024

Data Designer, Gretel’s latest innovation, takes the complexity out of synthetic data generation. Crucial to reducing complexity is Model Suites, a new design concept that simplifies the process of creating high-quality synthetic data while addressing critical challenges like licensing and compliance.

Why Model Suites?
Generating synthetic data often involves navigating a maze of model options and licensing restrictions, creating hurdles for teams aiming to focus on data quality and use case optimization. Model Suites provide a streamlined solution, offering curated collections of tools and models designed to:

Simplify Model Selection: Work with intuitive, high-level abstractions instead of getting bogged down in model details.
Ensure License Compliance: Meet regulatory and legal requirements effortlessly.
Guarantee Quality: Leverage collections vetted for producing high-quality synthetic data across domains.
Offer Flexibility: Access state-of-the-art models as soon as they’re available.
Provide Transparency: See exactly which models are used in your data generation logs.

By integrating Model Suites into Data Designer, we’ve created a workflow that not only accelerates synthetic data generation but also ensures users can confidently navigate complex compliance and quality requirements.

Before diving deeper into how Model Suites work, let’s explore why the future of synthetic data—and generative AI as a whole—is compound. 👇🏼

The Future of GenAI is Compound

At Gretel, we made an early bet that the future of generative AI applications and services—especially for synthetic data—would be compound. Instead of relying on a single large language model (LLM), generative AI implementations would need to integrate multiple tools and models working together in a coordinated way.

Fast forward to today, we see this trend accelerating with the rise of AI agents and compound AI systems more generally—intelligent systems that chain together specialized tools and language models to solve complex problems. Generative AI assistants now frequently integrate tools like web search, calculators, and code interpreters, and often make multiple LLM calls to refine their outputs.

However, while progress has been made in chaining tools and models, many implementations still rely heavily on individual large models, partly due to confusion and the challenges of managing licensing. Even the largest models—340B or 405B parameters—continue to face significant limitations:

Limited context understanding: Struggles with generating coherent outputs for long or nuanced queries.
Inconsistent outputs: Unpredictable variations in responses, leading to reliability issues.
Domain-specific limitations: General-purpose models fall short in specialized areas like healthcare, finance, or coding.

As agentic AI takes center stage in 2025, it’s becoming clear that the era of massive, kitchen-sink language models lacks the specificity required for scaling critical applications. The industry is moving beyond building hammers and instead assembling Swiss Army knives—compound AI systems that bring together the right tools for the right tasks.

Defining Compound AI Systems

Before diving further, let’s define what a Compound AI System is—straight from the BAIR’s (Berkeley Artificial Intelligence Research) mouth:

[A Compound AI System is] as a system that tackles AI tasks using multiple interacting components, including multiple calls to models, retrievers, or external tools. In contrast, an AI Model is simply a statistical model, e.g., a Transformer that predicts the next token in text.

The diagram below from BAIR illustrates this evolution, contrasting traditional monolithic AI models with compound AI systems that integrate multiple components to solve complex tasks more efficiently. On the left, traditional monolithic AI models are typically large, general-purpose systems designed to handle a wide range of tasks. On the right, compound AI systems integrate multiple models, processing steps, non-AI tools, and services into a holistic workflow capable of tackling complex problems.

Figure 1. The evolution of monolithic models to compound systems. Source: BAIR

Unlike monolithic models, compound systems provide greater flexibility and adaptability by leveraging the strengths of specialized, task-specific components working together. This approach not only solves complex tasks more effectively but often does so faster than relying on a single general-purpose model.

(Side note: While this diagram effectively captures the concept, it could benefit from showcasing architectures that emphasize small language models for better accuracy)

Gretel Navigator

High-quality synthetic data is notoriously difficult to generate. That’s why we built and launched Gretel Navigator, the first compound AI system purpose-built for generating and iterating on synthetic data. Whether you’re designing data from scratch, scaling up from a few examples, augmenting an existing dataset, or refining previously generated data, Navigator provides state-of-the-art (SOTA) tools to help teams eliminate bottlenecks. To enhance this capability, we’ve launched Data Designer, an extension of Gretel Navigator that allows one to quickly design and iterate on data via an intuitive workflow and built-in data evaluations. Data Designer helps users easily define their needs, preview and iterate on datasets, and scale data generation with confidence.

At the core of Navigator are specialized tools and models that work together to generate high-quality data at any scale. For example, to create a text-to-code dataset, one set of tools might seed the data generation process, ensuring a representative and diverse output. A specialized language model can then generate code, while another model handles pure text or numerical data. Additional tools refine the data by providing AI feedback, correcting mistakes, and enhancing complexity, diversity, and style. Other tools validate code to ensure it is parseable, executable, and accurate, while another layer evaluates quality and correctness.

Choosing the right models for these tasks can be complex, especially with the constant release of new state-of-the-art models. Navigator simplifies this process by letting users focus on the desired license for their data while Gretel handles model selection to fit their specific use case.

The result? Data of significantly higher quality than any single LLM can produce. In fact, it doesn’t matter what the latest SOTA LLM is. With our compound approach, Gretel Navigator extracts the best possible performance from the latest SOTA models as they hit the market. That’s why we stay at the forefront of model developments, incorporating cutting-edge releases like Nova, Mistral, LLama-3.x, and Gemini into our platform. This ensures our users immediately benefit from the progress—going beyond what’s achievable with a foundation model alone.

The Need for Model Suites

While our approach to synthetic data generation is compound, we aim to simplify the process. Users shouldn’t have to decide which tools to apply for specific tasks or manage the complexities of combining models with conflicting licenses. Most users of compound AI systems aren’t system builders and won’t know which combinations of tools and models are optimal for their needs.

Consider the following example adapted from BAIR. If an MIT-licensed model is combined with an Apache-2.0 model and a Llama-3.2-licensed model from Meta, what happens to the licensing of the overall compound system? Are there restrictions on how the resulting system—or its synthetic data outputs—can be used? When the synthetic data is intended for fine-tuning or training another model, ownership and usage rights become non-trivial concerns with significant legal and regulatory implications.

Figure 2. An example on a compound AI system. Source: BAIR

These challenges aren’t hypothetical. In their technical report on Granite 3.0 language models, IBM highlights:

“…in an enterprise setting, the adoption of LLMs can have further constraints. The provenance and transparency around data usage and processing can have legal and compliance implications. In particular, the license that an LLM is released under can also restrict companies from using a model on their specific use cases.”

In the example above, Meta’s more restrictive Llama license ultimately determines the system’s licensing terms. This means both the system and its outputs must comply with Meta’s proprietary license—which is not recognized by the Open Source Initiative (OSI), the organization stewarding open-source software. The Llama license includes commercial use restrictions, a requirement to prominently display the Llama name, and mandates to include the name in any derivative models, among many other things. In addition, Meta’s Acceptable Use Policy (AUP) prohibits certain activities, such as failing to disclose risks to end users or using their model for critical infrastructure or heavy machinery applications.

This is a relatively simple example. In practice, licensing challenges multiply when combining more tools or models, particularly with the growing prevalence of Small Language Models, the misuse of the term "open-source," and occasional ignorance—willful or otherwise—of licensing terms. For most users, this level of complexity is overwhelming.

How do you ensure data and model provenance? How do you guarantee legal and regulatory compliance? These are the questions that Model Suites are designed to answer.

Available Model Suites

To address the challenges of model selection and licensing compliance, Gretel Navigator and Data Designer offer three curated Model Suites, providing enterprises with the flexibility to align tools with their specific needs:

Apache-2.0 Suite:
1. Provides full data ownership with no restrictions on commercial use or dataset licensing.
2. Includes models recognized by OSI, such as Qwen-2.5 (Apache-2.0), Phi-3.5 (MIT), and Mistral NeMo (Apache-2.0), alongside other open-source tooling.
Llama-3.x Suite:
1. Leverages llama-3.1 and llama-3.2 models from Meta.
2. Can incorporate permissive open-source models and tools, such as Apache-2.0 and MIT-licensed options.
Amazon Nova Suite:
1. Leverages AWS Nova Micro, Lite, and Pro models.
2. Similarly to the Llama-3.x suite, can incorporate permissive open-source models and tools, such as Apache-2.0 and MIT-licensed options.

These suites ensure users can navigate licensing complexities while maintaining flexibility, compliance, and high-quality synthetic data generation.

Examples

Picking your Model Suite is simple. If you’re using the Data Designer SDK, you can initialize it with a single line of code. Here’s how to quickly configure your Model Suite using Gretel’s tools.

MODEL_SUITE = data_designer = DataDesignerFactory.from_blank_canvas(
    model_suite="apache-2.0"
)

‍

Alternatively, you can specify the model suite through the Data Designer YAML configuration:

model_suite: llama-3.x

special_system_instructions: >
  You are an expert at writing, analyzing, and editing Python code.
  Your job is to assist the user with their Python-related tasks.
  
categorical_seed_columns:
  - name: industry_sector
    values:
      - Healthcare

generated_data_columns:
  - name: text
    generation_prompt: >
      Write a prompt for a text-to-code dataset that 
      is related to {topic}.
  - name: code
    generation_prompt: >
      Write Python code that will be paired with the
      following prompt: {text}

post_processors:
  - validator: code
    settings:
      code_lang: python
      code_columns: [code]

‍

These workflows make selecting and leveraging Model Suites as straightforward as defining your project’s requirements.

Conclusion

Model Suites represent a significant step forward in making compound AI systems accessible and practical for enterprise use. By abstracting away the complexities of model selection and license compliance, they empower organizations to focus on their core objectives while maintaining legal and regulatory compliance.

The introduction of the Apache-2.0, Llama-3.x, and Amazon-Nova suites reflects Gretel's commitment to providing flexible, transparent solutions that address real-world challenges in synthetic data generation.

For enterprises looking to harness synthetic data while retaining full control and ensuring compliance, Model Suites offer a clear path forward. This approach is poised to become the industry standard for managing the complexities of compound AI systems, particularly in enterprise environments where transparency, compliance, and data quality are critical.

If you have questions or comments about Model Suites or the Gretel Platform, we’d love to hear your thoughts. Join us and more than 1,600 other developers, engineers, data scientists, and privacy fans in the Synthetic Data Community on Discord.