Filling in sparse tables with Gretel’s Tabular LLM
Sparse or incomplete data is one of the most common bottlenecks to scaling effective data pipelines. Often it is not the absolute lack of data, but the lack of consistently high quality data, which impedes data science and ML workflows. The generative capabilities of large language models (LLMs) appear to be a promising solution.
LLMs are trained to create contextually relevant and sometimes even novel information. The rapid rise of LLMs applied for natural language generation has shown models capable of producing high quality responses to contextual prompts. At Gretel, this inspired a question – can we extend these approaches to solve for the intersections of attention and context that are present in tabular datasets?
Working with tabular data introduces a unique set of challenges. Tabular datasets often represent a complex set of relationships, with each cell in a table holding multiple relationships with adjacent cells as well as at the row and column-level. Maintaining the table's structured format while reproducing the field and row-level correlations for numerical, categorical, and free text data types is hard to recreate in synthetically generated data. It’s why your standard LLMs will struggle to produce high-fidelity tabular datasets at scale (Levin et al., 2022). The task demands a tailormade solution, built from the ground up to work specifically with tabular data. In this blog, we’ll explore one approach using Gretel’s Tabular LLM.
Synthesizing Product Listings for E-Commerce
In this notebook and video walkthrough, we will use Gretel’s Tabular LLM to generate contextually relevant values for a sparse tabular dataset. Our dataset will be product listings for shoes, which represents multiple data constraints that we will want reflected in our synthetic insertions. For instance, our generated data should reflect the fact that US shoes come in a small range of whole and half sizes. Product descriptions should be additive and relevant to the given product, so too should any color options listed In our walkthrough we will contextually fill in these missing values, while respecting data validity constraints and providing relevant results that can be validated by a human observer (you!).
For our walkthrough we will work with a row-wise representation of tabular data. Let’s first consider what is needed to effectively complete a missing field. The first constraints are expectations of the data encoded by that column. The column wise constraints will vary based on the type of data in that column. The synthetically generated field will also need to cohere with the observations encoded across its’ row. It will need to make sense with those other observations. When working with Gretel models, we often evaluate how the aggregated results (including synthetic additions) maintain the statistical distribution of the original data set, and the cross-correlations between the fields in the tabular data set. There are column wise, row wise, and distribution wise contexts to maintain and operate within.
With all this additional context in the sparse tabular use-case, we will get higher quality results from an LLM which was purposefully designed to create and interact with tabular data.
Before diving into the notebook, there are a couple of prerequisites:
- Gretel API Key: You'll need an API key from Gretel. If you don't have one already, you can obtain it from Gretel's console. This key will enable us to use Gretel's services for generating our synthetic datasets.
- Access to Gretel's Tabular LLM: To utilize Tabular LLM, sign up at Gretel's Tabular LLM page to get access immediately.
Let's get started!
Part 2 - Scaling Prompt Engineering
In a tabular context, each of these possible constraints may become a reason to reject our synthetic results. When results don’t meet our expectations, prompt engineering becomes a way to iterate on producing a better result. Prompt engineering is a popular way to drive better results from an LLM, but is often a very human and manually intensive process. For enterprise applications, there needs to be automations within a data pipeline to scale. In a follow up post, we’ll explore the use of programmatic prompt engineering techniques that enable users to validate micro batches of data as they go, while still quickly generating and processing a bulk dataset.
We recently made access to Gretel's Tabular LLM early preview available to all users, so if you’re interested in testing this example or experimenting with your own applications, you can sign up and start immediately. If you have specific use cases you’d like us to explore in this series, drop us a note in the Synthetic Data Discord Community or email us at email@example.com.