Build high-quality datasets for AI using Gretel Navigator Data Designer

Gretel announces Data Designer for building high-quality, custom synthetic datasets with Navigator. Improve AI models on domain specific tasks and reasoning.

Published by

Kirit Thadaka

No items found.

•

Updated

November 12, 2024

Gretel Navigator Data Designer: Revolutionizing Synthetic Data Creation for Smart AI Development

Generate production-ready synthetic datasets in hours, not weeks. Gretel Navigator's new Data Designer tool eliminates the complexity of synthetic data creation, helping AI teams quickly build high-quality training datasets without compromising sensitive information.

Advantages of building high-quality AI datasets with Data Designer:

Speed: Generate preview datasets in minutes, production datasets in hours
Quality: Built-in best practices and evaluation metrics ensure accuracy, diversity, and relevance
Simplicity: Automated workflows replace complex manual processes
Scale: Move from proof-of-concept to production without rebuilding
Data-centric AI: Unlock true data experimentation with rapid iteration on use-case-specific data.

Why Synthetic Data Matters

In today's AI-driven world, access to high-quality training data is often limited by data privacy concerns, scarcity, or high data acquisition costs. While LLMs excel at general tasks, they often struggle with specific domains where training data is limited or contains sensitive information. Data Designer bridges this gap, enabling teams to create rich, diverse AI datasets without compromising sensitive information.

Imagine requesting "Create a Python script to analyze transaction patterns for fraud detection" and instantly receiving production-ready code. This is now possible through synthetic datasets built with Data Designer. One of our customers reduced their dataset creation time from 3 weeks to just 4 hours, enabling them to rapidly fine-tune a small language model to assist with financial analysis tasks.

How Building AI Datasets in Data Designer Works

Data Designer simplifies dataset generation through an intuitive workflow:

Define Your Needs: Specify the type of data you need using our simple configuration interface
Preview & Iterate: Generate sample AI datasets in minutes to validate your approach
Scale Up: Move to production-scale generation with built-in quality controls

Gretel’s compound AI approach ensures users get high-quality and diverse data at scale along with detailed evaluation reports, eliminating guesswork about dataset quality. Data Designer can build datasets to train chat-bots using multi-turn chat conversations, improve the reasoning ability of models (blog), enhance RAG (Retrieval-Augmented Generation) systems by training models on specific domains, improve a model’s ability to solve code generation tasks, evaluate models post training, and more!

Build your Dataset

Data Designer simplifies dataset generation by introducing a few core concepts that make the process more intuitive and efficient for users:

1. Model Suites: Curated collections of models designed to easily navigate the challenges of model selection, regulatory compliance, licensing and legal rights over generated data. As new LLMs are released, Gretel’s model suites ensure customers are benefitting from state-of-the-art models to get the best data quality. Model suites optimize the synthetic data generation process based on a customer's needs. Each suite is carefully assembled to provide high-quality data generation while adhering to specific licensing and usage requirements.

2. Data Seeds: Data seeds are used to guide Navigator to generate data related to topics most important to the task. Users can specify topics of interest to inject the diversity they need to get the most out of the models they are training. Seeds can be key-value pairs, for example “Industry” → “Healthcare”, or even complete records of data that represent that type of AI dataset you want to create.

3. Generation Prompts: For each column of data that needs to be generated, users can provide detailed prompts to guide Navigator to generate data. Data seeds can be used in the generation prompts to inject the diversity a user wants in their dataset.

4. Post Processors: Users can configure how the data they generate should be validated and evaluated. When generating high-quality code, customers may have certain rules that the data columns they generate should follow. For example, if generating a dataset of Text-to-Python examples, the code column should be validated so that correct Python code is generated. Similarly customers may have use cases where a column should follow a specific distribution, or be only numbers, etc. Additionally, customers can configure how the dataset should be evaluated. Navigator uses sophisticated LLMs to evaluate the generated data and provide users with a comprehensive metric report to help them understand whether the data they generated is high-quality or not.

‍5. Data Designer Configuration: This is the primary interface developers use to describe the datasets they want to generate and control the generation process. All the concepts above are expressed through a Data Designer Configuration giving users a declarative way to build their AI datasets.

model_suite: apache-2.0

special_system_instructions: >
  You are an expert at writing, analyzing, and editing Python code.
  Your job is to assist the user with their Python-related tasks.
  
categorical_seed_columns:
  - name: industry_sector
    values:
      - Healthcare
    subcategories:
      - name: topic
        values:
          Healthcare:
            - Electronic Health Records (EHR) Systems

generated_data_columns:
  - name: text
    generation_prompt: >
      Write a prompt for a text-to-code dataset that 
      is related to {topic} in the {industry_sector}
      sector.
  - name: code
    generation_prompt: >
      Write Python code that will be paired with the
      following prompt: {text}
      
post_processors:
  - validator: code
    settings:
      code_lang: python
      code_columns: [code]
  - evaluator: text_to_python
    settings:
      text_column: text
      code_column: code

6. Workflow: A data generation pipeline comprising configurable tasks executed by the Data Designer. A Workflow encapsulates compound AI system interactions, tool/model selection for each task of the pipeline, and synthetic data generation best practices. Workflows can be executed to generate quick data previews or generate data at scale using Gretel managed compute. ‍

7. Blueprints: These are Data Designer Configurations built by Gretel to solve specific use cases so that users don’t need to spend time iterating on their configurations. The first set of Blueprints we released are for building Text-to-Python and Text-to-SQL datasets.

Understanding your Dataset

Once a user creates a Data Designer configuration, they can generate a preview of their dataset to get a quick understanding of whether they have the right configuration or not. Data Designer’s quick data previews allow the user to easily experiment, verify a particular approach, fail fast, and be much more agile in generating data that raises the bar. Combined with easy YAML configuration, validation, evaluation, reproducibility, and other Data Designer capabilities, previews unlock a true data experimentation flywheel for anyone working with generative AI.

data_designer = DataDesigner.from_config(config_string, **session_kwargs)
preview = data_designer.generate_dataset_preview()

Users can use Data Designer to inspect specific records for detailed insights.

preview.display_sample_record()

Once you’ve iterated on your configuration through inspecting data previews, you can submit a batch job to Gretel to generate data at scale using Gretel’s cloud platform. Gretel will generate a comprehensive data quality report at the end of your batch job giving you a detailed breakdown of the dataset. This includes a data diversity analysis and LLM-based quality assessments. Here is an example of the insights you can expect in an evaluation report.

Driving FinTech Innovation with a High-Quality Text-to-Python Dataset

One of the first use cases we explored with Data Designer is to build high-quality Text-to-Code datasets. To illustrate this, we’ve created a Text-to-Python dataset tailored to the FinTech industry. What if you could ask an AI chatbot to give you a Python script to find patterns that indicate fraudulent transactions and instantly receive the code you need? A capability like this democratizes access to advanced financial analytics, empowering analysts, and other professionals to implement their ideas without extensive coding skills.

However, for LLMs to generate accurate and contextually relevant code, they need to be trained on specialized datasets. Existing datasets often lack the specific terminology and scenarios unique to FinTech, limiting the models' effectiveness in real-world applications. By fine-tuning LLMs with our dataset, we enhance their ability to produce precise and relevant code, accelerating innovation across the industry.

To demonstrate Data Designer's capabilities, we've published a production-ready FinTech dataset on HuggingFace: gretel.ai/gretel-text-to-python-fintech-en-v1. This dataset contains over 25,000 text-to-code pairs focused on financial analysis tasks, all Apache 2.0 licensed for commercial use. You can use it directly to train your models or as a template to understand how to generate your own custom datasets with Data Designer.

Conclusion

Gretel's suite of tools offers endless opportunities to create synthetic data tailored to specific business needs, incorporating custom logic and meaningful context. We invite the community to leverage our tools to generate synthetic data and accelerate progress in your AI-driven business solutions. Stay tuned to see new use cases and AI datasets we create using Data Designer!

Data Designer is currently in early preview, if you would like to build datasets like this or try out other use cases, sign up here to participate!

If you have questions or comments about Data Designer or the Gretel Platform, join us and 1,600 other developers, engineers, data scientists, and privacy fans in the Synthetic Data Community on Discord.

Get started with Gretel

Learn how our platform can streamline your data privacy and compliance.

Want to learn more?

Discover the benefits of synthetic data.

Request a demo

Generate synthetic data at scale

Synthesize millions of rows with high accuracy using Gretel ACTGAN.

Scale your data