Fine-Tuning Gretel Navigator To Generate Highest Quality Domain-Specific Synthetic Data
We are excited to announce the public preview of Navigator Fine Tuning, the latest advancement in our suite of synthetic data solutions. This new feature builds upon the recent general availability of Gretel Navigator, enabling you to generate data not only from a prompt, but also from fine-tuning the underlying model on your domain-specific real-world datasets to generate the highest quality synthetic data.
Recap: What is Gretel Navigator?
Gretel Navigator is our innovative product that tackles the cold-start problem head-on. It leverages a compound AI system, agentic workflows, and pre-trained large language models (LLMs) to generate robust tabular datasets from user prompts, transforming raw concepts into ready-to-use, task-specific data. Whether you're launching a new project or need to augment existing datasets, Navigator has become an indispensable tool for data scientists, machine learning practitioners, and enterprise data teams, including our launch partners Databricks, EY, Google, and Microsoft.
Navigator Fine Tuning: Tailoring Data to Your Needs
While Navigator already excels at creating diverse and useful synthetic datasets from mere prompts, we understand that using prompts only may result in somewhat generic data that does always meet the precision requirements of specific business contexts or domains. That's where Navigator Fine Tuning comes into play.
Fine-tuning allows users to inject their business and domain-specific knowledge into Navigator by training it on a relevant real-world dataset. Similar to our other models like ACTGAN and Tabular DP, fine-tuning enables Navigator to learn the specific characteristics, correlations, and distributions within your data, generating synthetic versions that maintain the original nuances while maximizing privacy protections.
Unmatched Versatility Across Multiple Modalities
One of the standout features of Navigator Fine Tuning is its support for multiple tabular data modalities within a single model. This means you can now generate datasets that maintain correlations across:
- Numeric Data: Continuous or discrete numbers
- Categorical Data: Categories or labels
- Free Text: Unstructured text entries and long-form natural language such as email messages or notes in medical treatment summaries
- Time Series: Sequential time-stamped data
- JSON Data: Complex nested structures
All these data types can coexist within a single dataset, maintaining correlations not just within individual rows, but also across events spanning multiple rows, making Navigator an exceptionally powerful tool for time series data generation.Â
Example Use Case: Healthcare Data
To illustrate the power and flexibility of Navigator Fine Tuning, let's consider a healthcare dataset consisting of patient events including hospital visits, tests, diagnoses, prescriptions, and doctors' notes. We started from a public dataset of patient case descriptions, and prompted Navigator to turn them into structured data capturing a series of realistic synthetic events based on the unstructured descriptions.
Next, we trained a Navigator Fine Tuning model on this dataset, with the goal to generate thousands of additional records cost effectively, while maintaining the original dataset’s correlations and distributions. Below are some samples from the fine-tuned model:
After fine-tuning for a few minutes, our model captured the essence of the training data, without having been explicitly programmed to do so. Patient names and dates of birth within a single patient ID match, events follow a predictable sequence (e.g. hospital admission then treatment then discharge), and even the prescribed medications (including several that did not exist in the fine-tuning data) are actual drug names which are routinely prescribed for the conditions listed in the same records.
What's Next?
We are excited to see how you leverage Navigator Fine Tuning to solve your unique data challenges. Get started today in the Gretel Console blueprint or our SDK notebook and don't for get to check out our docs.
We are continually honing our technology to stay at the cutting edge of synthetic data generation. Our next big milestone includes adding support for differentially private fine-tuning. This will further enhance the privacy guarantees of the synthetic data generated, ensuring that even highly sensitive datasets can be synthesized while minimizing risks of exposing any individual data points.
Happy synthesizing!