Sample-to-Dataset: Generate Rich Datasets from Limited Samples Using Data Designer

Seed to succeed: use the sample-to-dataset workflow to create diverse, large-scale synthetic datasets tailored to your needs with nothing but a few samples.

Key Benefits of Navigator Data Designer - Sample-to-Dataset

  • Generate 1000s of records from just a few samples
  • Automatically expand and explore the problem domain as a human expert would
  • Agent-based pattern detection finds meaningful seed categories in your data
  • Maintains your data's exact format and context (e.g. style, template, code)
  • Automatic data cleanup and standardization out of the box
  • Fast experimentation through preview & iterate workflow 

Let's dive in! 👇

The Case for Sample Data

The age-old wisdom "a picture is worth a thousand words" succinctly describes how a single image can convey what paragraphs of text often struggle to describe. It turns out that the same principle applies to data: a sample of data reveals structure, formatting quirks, implicit business rules and intricacies that even pages of documentation may struggle to convey.

“What does this have to do with synthetic data?” you may ask. Well, one of the trends we identified is that almost everyone looking to fine-tune an LLM is more than willing to spend a little time putting together just a few sample records describing the data they want. Think of a technical product manager building a Retrieval Augmented Generation (RAG) application for a healthcare company, a machine learning engineer looking to improve performance of a Named Entity Recognition (NER) model or a data scientist working on a new AI feature for a real-estate company. Sure, they can describe things in words. However, that might take more time than just coming up with a few records from scratch! Or maybe they already have data that’s quite messy and they were able to curate a handful of “golden” records.

The challenge however, is that much more than a handful of records is needed to customize even a Small Language Model (SLM), never mind a model with tens or hundreds of billions of parameters. How can one go from a few samples to tens or hundreds of thousands of diverse, rich and high-quality records? Enter Gretel’s compound AI system, Navigator, for creating synthetic data and the new sample-to-dataset capability in Data Designer!

The Sample-To-Dataset Workflow

The focus of the new sample-to-dataset workflow is quite different from synthetic generation aiming to capture statistical properties of an existing dataset. One can’t aptly capture statistical properties from just a few sample records. Instead, the new workflow addresses a very common use-case of needing to generate a very diverse and rich dataset the way a human expert would in order to meaningfully explore and expand a problem domain. Think generating diverse data for AI training/fine-tuning, addressing class imbalances, increasing complexity and richness of data. All much faster and cheaper than employing an army of human labelers.

The sample-to-dataset workflow builds on Data Designer’s modular foundation and thesis that solid synthetic data tooling must eliminate grunt work and come with best data generation and evaluation practices built-in. Already available in our Python SDK, getting started with sample-to-dataset is quite easy (feel free to follow along in this notebook)!

First, we are going to import Data Designer and specify the model suite (e.g., “apache-2.0”, “llama-.3x”). Selection of a model suite depends on your use case as well as legal, compliance and data ownership requirements. It affects which set of tools/models are used by the compound AI system behind Gretel Navigator:

from gretel_client.navigator import DataDesignerFactory

MODEL_SUITE = "apache-2.0"

Next, we are going to curate just a few samples of data.

For the purposes of this blog post, we are going to use a small sample from the Kaggle dataset Real Estate London 2024 (however, see other examples in the notebook as well!). This dataset “provides a snapshot of properties listed for sale in London … It includes various property details such as the number of bedrooms, bathrooms, type of property, and price.” Imagine a developer working on a RAG system, and having just a few high-quality samples for a specific real-estate market. They are looking to generate substantially more records to either develop or evaluate their system and/or fine-tune their task-specific model.

Figure 1. A sample from the Real Estate London 2024 dataset.

We can now instantiate Data Designer from sample records:

data_designer = DataDesignerFactory.from_sample_records(
   sample_records=sample_records,
   model_suite=MODEL_SUITE,
   api_key="prompt"
)

Step 1: Seeding the Data

Core to being able to generate any diverse and large-scale dataset is properly seeding the synthetic data generation process. Think of seeding as planting a garden - success lies not just in the quantity of seeds, but in knowing which ones to plant and where to plant them. Similarly to a landscape designer envisioning how carefully-chosen seeds or saplings will grow into a lush garden, we must select data seeds that capture the essential patterns and relationships we want to see flourish in our final dataset. Data seeds give us important control levers, allowing us to shape the diversity, complexity, and scale of the synthetic data.

Data Designer already allows one to seed data through a semi-automatic process in which a user can manually specify that seed categories and seed values are automatically produced. The sample-to-dataset workflow takes this to a new level and adds a capability to automatically identify, evaluate and suggest data seed categories, based on a sample of data. This automation removes a tremendous amount of grunt work, all while tailoring synthetic data to a company-specific and application-specific use case. Extracting data seeds based on a small sample of data is the very first step of the sample-to-dataset workflow:

data_seeds = data_designer.run_data_seeds_step(
   max_num_seeds=7,
   num_assistants=5,
   verbose_logging=True,
)

Let’s see what happens here when we use just 5 samples of the real-estate data:

💭 Step 1: Extract data seeds from sample
  |-- 🧹 Pre-processing the sample dataset
  |    |-- 📏 Standardizing dataset column names:
  |	    |-- 1) All characters must be lowercase.
  |	    |-- 2) Names should use snake_case format.
  |	    |-- 3) Only alphanumeric characters and underscores are allowed.
  |	    |-- 4) Column names should not start with a number.
...

Out of the gate, best data practices are applied by default. The sample dataset is pre-processed and its schema is standardized to make sure all names are lower-case, use snake case, are alphanumeric and do not start with a number. The final schema looks much better than the original and we are ready to start extracting data seeds:

  |-- 🧠 Extracting data seed categories from the preprocessed sample dataset
  |-- 👥 Polling results from 5 assistant LLMs
  |    |-- 🤔 assistant 1 is thinking
  |    |-- 🤔 assistant 2 is thinking
  |    |-- 🤔 assistant 3 is thinking
  |    |-- 🤔 assistant 4 is thinking
  |    |-- 🤔 assistant 5 is thinking
  |    |-- 🤔 assistant 1 is thinking
  |    |-- 🤔 assistant 4 is thinking
  |-- 🏅 Ranking and pruning the extracted data seed categories
  |    |-- 🏆 Final ranked selection has 17 data seed categories
  |-- 🌱 Generating values for each of the 13 ranked data seed categories
  |-- 📐 Checking that generated data seeds are conceptually orthogonal
  |    |-- 🍃 dropping 5 non-orthogonal data seed categories

...

First, we focus on extracting data seed categories. Just as a landscape may be seeded with different categories of plants - anchor plants, accent plants, climbers, etc. - our data seed categories capture different properties of the dataset and subsequently establish core patterns, add texture, and introduce variations. Here and in everything that follows, we are also leveraging much more involved inference methods than a simple call to an LLM in order to elicit an inner monologue and better reason through complex problems. In this example, we are using Cognition, which is an inference technique developed by Gretel that goes beyond Reflection-style approaches and introduces important concepts replicating additional cognitive processes (more on this in a separate blog post).

Second, we leverage multiple models within the chosen model suite (apache-2.0 in this case) to crowdsource opinions on data seed categories that are suitable for the provided sample of data. Note that each assistant response is automatically evaluated and some assistants may take another pass at thinking through the task (assistants 1 and 4 do this in the example above). This is a great example of feedback at play throughout Data Designer.

Third, we rank and prune the crowd-sourced categories to make sure we have a de-duplicated and ranked list of categories across all of the assistants. This is important because not all seed categories are created equal. We want to zone in on those categories that will shine.

Fourth, we generate actual values for all identified and ranked data seed categories. The goal here is to generate diverse, relevant and consistent seed values.

Lastly, we perform an orthogonality check among the various data seed categories that have been crowdsourced. Just like several people may have quite similar ideas that are expressed using different words, data seeds with different names and descriptions may be capturing the same or similar concepts. The orthogonality check aims to ensure that data seed categories explore completely different aspects of the dataset and that the combination of their values is not going to yield non-sensical context. The final result for the 7 requested data seeds is shown below:

🙌 Extracted 7 data seeds from the sample data:
    * category: property_condition
      values: ['new', 'used', 'renovated', 'well-maintained', 'needs repair', 'historic']
    * category: location_type
      values: ['city', 'town', 'village', 'hamlet', 'resort', 'estate']
    * category: view
      values: ['cityscape', 'countryside', 'waterfront', 'mountain', 'forest', 'panoramic']
    * category: parking_spaces
      values: [0, 1, 2, 3, 'garage', 'carport', 'off-street', 'on-street']
    * category: floor
      values: ['ground', 'first', 'second', 'third', 'fourth', 'penthouse', 'basement']
    * category: furnishing_status
      values: ['fully_furnished', 'partially_furnished', 'unfurnished', 'semi-furnished', 'furnished']
    * category: amenities
      values: ['gym', 'spa', 'private_garden', 'pool', 'jacuzzi', 'sauna', 'steam room', 'fitness center', 'yoga studio', 'meditation room', 'library', 'home theater', 'wine cellar', 'game room', 'playroom']

Note that everything above was done automatically without ever having seen this dataset. Behind the scenes, we figured out: 

  • what this dataset is about (real estate)
  • its specific application (property listings) and context (e.g., London)
  • which considerations may be important to someone considering buying a property or just examining a property listing
  • how to translate all of the above into diverse data seed categories

Quickly examining the seed categories, it’s surprising how relevant they are to the dataset without us having to do anything beyond providing a sample of data. The data seed values are also unique, non-repetitive and add important texture and variety to the data being generated. While the size of the sample dataset is important (both number of columns and number of rows) in that it’s directly related to the amount of context available to Data Designer, we have seen success with as few as 5-10 records. It all feels like magic, but is really solid compound AI system engineering.

Step 2: Generating  a Dataset Preview

By design, every run of the data seed extraction process will produce a different set of categories and different seed values. This allows for rapid iteration and experimentation on the user side, which is a core tenant of Data Designer. Designing data for your specific needs should be as easy as possible so that you can incorporate a data experimentation flywheel int your AI development process. Once a user is happy with what they see, they can generate a dataset preview with the following:

data_seeds = data_designer.run_data_seeds_step(
max_num_seeds = 7,
verbose_logging=True
)

In this specific example, we load the extracted data seeds, but given the modularity of Data Designer, one could combine with other data seeds as well, be it manually provided data seeds or data seed generated by another process.

🚀 Generating dataset preview
📥 Step 1: Load data seeds
  |-- 🌱 Loading categorical data seeds into the workflow
  |   |-- Number of seed categories: 7
🦜 Step 2: Generate dataset from sample records
  |-- 🤖 Preparing to generate a synthetic dataset based on the input sample records and categorical data seeds
  |-- 📝 Generating a description of the sample data to be used in the synthetic data generation prompt
  |-- 💬 Crafting the data generation prompt
  |-- 🎲 Randomly sampling categorical data seeds into dataset with 13 records
  |-- 🦾 Generating rich and diverse synthetic data based on the provided sample
  |   |-- Target number of records: 50
  |   |-- Number of seed permutations: 13
  |   |-- Number of records to generate per seed permutation (with buffer): 5
👀 Your dataset preview is ready for a peek!

A few things happen above. We think through how to best describe the sample dataset, based on the data sample and what we have learned about it. We then figure out how to craft and optimize the data generation prompt so that it is properly seeded. We seed the data and generate 50 preview records so that a user can examine them, have confidence in the generation process, and give a final sign-off before a much bigger dataset is produced.

To cycle through preview records, a user can make a simple call below:

preview.display_sample_record()

Again, the focus here is on rapid design, iteration, and experimentation:

Figure 2. Generated seed and data columns.

What’s interesting here is that not only are the data seeds incorporated into the entire record (e.g., see the brutalist architecture style, nearby library, and Sir Harold Gillies referenced in the description), but the formatting, style and context are aptly captured. For example, the title makes it clear that this is a listing in London. The added_on date includes not just a date, but the reduced designation present on some records as well. The description_html field does contain html tags, as in the original dataset. The price is listed in British pounds. There are many more nuances and it’s eye-opening to see all of this being produced automatically, based on a small sample of data.

Step 3: Scale Up!

Once the user is happy with the results they see, scaling up is easy using Gretel’s optimized infrastructure. A user can issue a simple request to generate a much bigger batch of data:

batch_job = data_designer.submit_batch_workflow(
   num_records=1000,
   data_seeds=data_seeds,
   project_name="sample-to-dataset"
)

In this example, we request to generate 1,000 records using the data seed categories and seeds identified above.

🛜 Connecting to your Gretel Project:
🔗 -> https://console.gretel.ai/proj_2p7TnyBRYxHbsX0LwQChEDdWOwW
▶️ Starting your workflow run to generate 1000 records:
  |-- project_name: sample-to-dataset
  |-- project_id: proj_2p7TnyBRYxHbsX0LwQChEDdWOwW
  |-- workflow_run_id: wr_2p7ToJ5aVUPlKVJvbPzXPyKtzJk
🔗 -> https://console.gretel.ai/workflows/w_2p7ToD6VgOjHd9QHc577aRw0dZB/runs/wr_2p7ToJ5aVUPlKVJvbPzXPyKtzJk

You can see the entire sample-to-dataset workflow in your Gretel console and follow along as the dataset is being generated.

While we didn’t include it here, it’s important to emphasize that sample-to-dataset can be easily combined with all other capabilities built into Data Designer, including data validation (e.g., if you are generating code, you can bring in code validation capabilities), and a comprehensive evaluation suite complete with a Data Quality Report.

Looking Forward

At the core of the sample-to-dataset workflow is the idea that a user should be able to bring their own data and subject matter expertise to bear on the problem of generating synthetic data. Sample-to-dataset makes this process easy by automatically seeding data with things that are relevant to the specific data being provided as a sample. No need to specify data seeds. We bring compound AI system thinking and synthetic data generation expertise to bear on this problem and eliminate a substantial amount of grunt work for you.

With models becoming more and more advanced with respect to general capabilities, most of the value in customizing LLMs will come not from customizing with general-purpose data, but with data that is unique to your company and use-case. Sample-to-dataset is just one automated approach built into our modular Data Designer architecture. We are excited to bring more techniques for tailoring synthetic data to our users’ needs and doing so with safety and privacy in mind in the coming months.

Give it a try!

Data Designer – true to its name – allows one to quickly design and iterate on high-quality synthetic data for a specific AI/ML task. The new sample-to-dataset workflow in Data Designer lets users quickly adapt and infuse sample data merely by showing the Data Designer what it looks like. It expands a considerable suite of tools already available to our users and brings automation that speeds up data experimentation and makes it fun!

Gretel's suite of tools offers endless opportunities to create synthetic data tailored to specific business needs, incorporating custom logic and meaningful context. We invite the community to leverage our tools to generate synthetic data and accelerate progress in your AI-driven business solutions. Stay tuned to see new use cases and datasets we create using Data Designer!

Data Designer is currently in early preview, if you would like to build datasets like this or try out other use cases, sign up to participate!

If you have questions or comments about Data Designer or the Gretel Platform, join us and 1,600 other developers, engineers, data scientists, and privacy fans in the Synthetic Data Community on Discord.