Gretel Smart-Seeding is auto-complete for your data

Smart-seeding lets you train a synthetic data model to auto-complete partial records and text.

Over the last few weeks, we have released how-to guides on two particularly challenging use cases for synthetic data- removing bias from datasets, and generating synthetic time-series data. Both of these use cases are made possible by a feature we call smart-seeding: the ability to use the synthetic model’s neural network model to auto-complete a partial record.

In a tabular dataset, this is essentially “seeding” a synthetic model with several columns for a record in the same format it was trained on, and then asking the model to complete the rest of the record. Today we are releasing two new Blueprints that utilize Gretel's smart-seeding and provide interfaces that automate this process for you.

The use cases for smart-seeding we have observed are:

  • Recreating time series data with the ability to learn seasonal patterns in data.
  • ‍Reducing biases in ML datasets by equalizing the representation of minority classes in data.
  • Creating synthetic data that has the same shape as the original data, while preserving vital portions of each row in the table. This might include preserving primary and foreign keys, tightly coupled location data, or other categorical and continuous values that are critical to downstream use cases.

Let's take a look at the new blueprints.

Time Series Data

This blueprint automates the logic to train synthetic time-series models on trends in data versus absolute values, so you don't have to build the specific trending data from the source dataset. For example, this is particularly useful for training synthetic models on financial datasets with open-high-low-close data. All you need to do is provide your time/date column and the column(s) that represent the observations at those points in time.

As a note, you may specify more than one "trend" column here. The library will automatically compute the trend lines for you and restore the trend data back to synthetic data after generating the new DataFrame.

General Smart-Seeding

‍This blueprint utilizes a new model that allows you to take partial values from your training DataFrame and use them as input for synthesis. You do this by specifying one or more seed columns. For instance, if you have a DataFrame with 5,000 rows, and choose columns A, B, C, the model will be trained with those columns as "smart seeds."

When generating data, the 5,000 3-tuple values of A, B, C will be used as input to the model, and the model will synthesize the rest of the record for you. This generates a 1:1 mapping of training data to synthetic data.

Optionally, you may enable an auto-correlation feature for smart seeding. For example let's assume a numerical column called "age" and you want to use this as a smart seed. Also assume a categorical column that is derived from "age" called "age_group." Because "age_group" is so tightly linked to "age", the model will automatically include "age_group" as a smart seed.

Summary

‍We're thrilled to get such good feedback from our users. If these use cases are familiar to you and you are exploring synthetic data, dive into the Gretel Beta, it's free!

As always we welcome feedback via email (hi@gretel.ai) or you can find us on Slack!