Tabular Fine-Tuning is Now Generally Available

Generate domain-specific tabular datasets with multiple modalities.

Published by

•

Updated

August 28, 2024

Introduction

Gretel is thrilled to announce the General Availability (GA) of Tabular Fine-Tuning. This latest addition to our suite of synthetic data solutions allows users to inject their business and domain-specific knowledge into Navigator, our privacy-preserving compound AI system, by training it on a relevant real-world dataset. One of the standout features of Navigator Fine Tuning is its support for multiple tabular data modalities within a single model, including numerical, categorical, free text, and sequential (time-series) data.

During our open preview period, we saw hundreds of models trained with thousands of minutes of API runtime from our developer community and enterprise customers. This showcases the clear interest in creating synthetic versions of complex enterprise datasets such as patient health records, sales data, and financial transaction histories. We have integrated these learnings to improve our model and ensure performant results across a variety of edge cases and industry domains in this period to ensure a seamless, scalable experience post GA. We feel so confident in these capabilities that Navigator Fine Tuning is now our default model offered when selecting the ‘start from scratch’ blueprint on our console dashboard.

Figure 1. The Gretel Start From Scratch Blueprint.

In the sections below we’ll provide a short summary of Navigator Fine Tuning’s capabilities as well as a few examples of how simple it is to get started with different use cases.

Navigator Fine Tuning Overview

For those new to Navigator Fine Tuning, below are some of the key features that make it such a popular option for domain-specific synthetic data generation:

Strong privacy guarantees: Navigator Fine Tuning performs well when measured on its ability to protect sensitive entities from your training dataset, even when measured on our new privacy metrics.
Flexibility: With a single model, users can now tackle complex tabular datasets that have a range of field types, such as time-series, categorical, numerical, and free text.
High-quality: Under the hood, Navigator Fine Tuning leverages the power of a pretrained transformer-based model, providing best-in-class data fidelity, as measured by our Synthetic Data Quality Report.
Simplicity: Setting the configuration for Navigator Fine Tuning is very easy. The default configuration serves as an excellent starting point and is generalizable for a wide range of datasets.
Console and SDK: Navigator Fine Tuning is accessible both directly from our Console using its Blueprint, or with just a few lines of code, you can do the same using our SDK.

🏎️ Take NavFT for a Test Drive

Below we step through four examples that showcase the flexibility and simplicity of applying Navigator Fine Tuning to a diverse range of tabular data modalities. Feel free to follow along and code with us – you can copy the code directly or use this Colab Notebook!

To get started, you will need a free Gretel key. 🔑 Next, install the Gretel’s Python 🐍 client:

pip install -U gretel-client

We start with some boilerplate code that initializes a Gretel session and defines the base path to our example data:

from gretel_client import Gretel

# initialize your Gretel session
gretel = Gretel(project_name="navft-ga-examples", api_key="prompt")

# base path to dataset s3 bucket
data_path = "https://gretel-datasets.s3.us-west-2.amazonaws.com/navft_ga"

Example 1: Fine tune on numerical data

For our first example, we’ll train Navigator Fine Tuning on a dataset for modeling wine quality based on physicochemical tests. 🍷 This is a purely numerical dataset, containing floating-point and integer values.

Sample data (url):

Here’s how to train a model using the default configuration:

trained_wine = gretel.submit_train(
    base_config="navigator-ft",
    data_source=f"{data_path}/uci_white_wine_quality.csv",
)

As the below image shows, Navigator Fine Tuning performed excellently on both our Synthetic Data Quality Report and privacy scores!

‍Side note: Our privacy score runs on all datasets, regardless of whether we expect particular fields to be private. It simulates attacks to see how vulnerable the data are, regardless of whether its about wine or health records. To learn more about our evaluation tools and safety measures, read the docs on our synthetic data quality scoring system.

Figure 3. Synthetic data quality scores for generated wine data. Source: Wine SQS Report

Example 2: Fine tune on categorical data

Next, let’s see how we can fine tune the model on a census income dataset, which primarily consists of categorical integers and strings.

Sample data (url):

*Figure 4. Sample of census income data.*

We only need to change the path to the data source. We can use the same default configuration:

trained_income = gretel.submit_train(
    base_config="navigator-ft",
    data_source= f"{data_path}/uci_adult_income.csv",
)

Our synthetic data quality and privacy metrics again highlight an excellent performance by Navigator Fine Tuning:

Figure 5. Synthetic data quality scores for generated census data. Source: Census SQS Report

Example 3: Fine tune on free text data

As an LLM-based model, Navigator Fine Tuning also naturally supports free text data. This example uses a dataset for evaluating the performance of intent classification systems in the presence of "out-of-scope" queries.

Sample data (url)

Figure 6. Sample intent classification system data.

Again, we can use the same default training configuration. We just need to change the path to the data source:

trained_clinc = gretel.submit_train(
    base_config="navigator-ft",
    data_source=f"{data_path}/clinc_oos.csv",
)

To compare the synthetic text with the real data, we can use Gretel's Synthetic Text Data Quality Report. The figure below shows that the structure of the synthetic and real text are nicely aligned.

Figure 7. Comparisons of original and synthesized data. Source: Text SQS Report.

Example 4: Fine tune on a mix of tabular modalities

Finally, using the same configuration as the above examples, Navigator Fine Tuning can seamlessly support tables that contain a mix of numerical, categorical, and free text data. In this example, we use a synthetic financial transaction dataset, which we generated from scratch with Gretel Navigator.

Sample data (url)

Figure 8. Sample of financial transaction data.

Figure. 9. Synthetic data quality scores for generated financial transaction data. Source: Transactions SQS Report

trained_transactions = gretel.submit_train(
    base_config="navigator-ft",
    data_source=f"{data_path}/navigator_financial_transactions.csv",
)

Conclusion

We're excited to add fine-tuning capabilities to our privacy-preserving compound AI system, Navigator, and expand ways for developers to safely design tailor-made data solutions to their AI projects. We hope these workflows give you a sense of what's possible, and spark ideas for how you can use Tabular Fine-Tuning in your own AI development.

If you have any questions about Gretel or how to use Tabular Fine-Tuning, join our Discord community for instant access to expert advice from the Gretel team! If you're interested in developing your synthesizing skills further, Gretel University is also now live—your ultimate hub for mastering synthetic data, where you’ll find valuable insightful videos and our curated list of expert resources.

Go forth and synthesize.

‍

Update: "Navigator Fine Tuning" is now called "Tabular Fine-Tuning."

Get started with Gretel

Learn how our platform can streamline your data privacy and compliance.

Want to learn more?

Discover the benefits of synthetic data.

Request a demo

Generate synthetic data at scale

Synthesize millions of rows with high accuracy using Gretel ACTGAN.

Scale your data