Gretel Synthetics: Introducing v0.10.0
Today we are thrilled to release new features for gretel-synthetics that make it even easier to get started with synthetic data. After initial rounds of testing and feedback from customers we are introducing a new interface that works directly with Pandas DataFrames and allows batched training of columns to exchange your source DataFrame directly with a synthetic DataFrame of the same shape.
Got DataFrames? Want synthetic data? It is time.
The Batch interface uses the existing config, training, and generation modules under the hood while automating some of the manual steps a user would have to take to work with complex and highly dimensional datasets.
The motivation behind the Batch interface is to support datasets with high dimensionality and column counts. This is achieved by clustering like columns and training models on subsets of the entire dataset at once. While batching is generally not required to generate synthetic datasets with less than 30–40 columns, scaling to support datasets with hundreds or thousands of columns requires a different approach.
One of the biggest benefits to Gretel’s synthetic data library is its ability to automatically learn and maintain correlations and statistical relationships between columns at both the dataset and record levels. For example, consider height and weight. Taller people tend to be heavier — that’s an example of a positive correlation, which would be learned automatically by our model. Not all columns are directly correlated, however, and by splitting the input training into batches or clusters of columns, it becomes possible to scale synthetic generation to highly dimensional datasets with minimal loss in accuracy.
Let’s explore how we create our Batch interface. Click the link below to build a synthetic dataset interactively with our batch training notebook (including free GPU access 🔥) on Google Colaboratory.
First, define parameters for training Gretel’s synthetic data model:
Gretel Synthetics’ DataFrameBatch object only takes two keyword arguments to function.
First is your source PandasDataFrame and second is a configuration template. The parameters for the configuration template are exactly the same as the configuration parameters you would use for a synthetic configuration, simply put them into a dictionary instead. Version 0.10.0 also includes detailed doc strings for each of the config parameters so you can explore and tune for more sophisticated use cases.
By default we split the input DataFrame into batches of 15 columns. This can be changed via init params and additionally you may use a batch_headers parameter to provide your own batches of headers.
There are a couple of notable changes / updates:
- input_data_path is not required (and will be ignored if provided) since this module will create the training data for you based on the input DataFrame.
- gen_lines is optional and if omitted we will automatically set this value to be the number of rows as the input DataFrame. This means that the synthesized data should be of the same shape of the input DataFrame!
- checkpoint_dir will automatically have “batch_N” sub-directories created under it. Each directory will contain the training data and models specific to that batch’s data.
- field_delimiter is required for Batch mode. If no validators are provided, at a minimum, the batch module automatically ensures generated lines have as many columns as the input training data. This parameter controls how generated lines are split back into columns.
- Note that regardless of how headers are split and batched, the final synthetic DataFrame will have the headers in the same order as the input DataFrame.
A vital step in data synthesis is validating that generated lines meet specific constraints. Gretel Synthetics allows validators to be used for each line, this functionality extends into the Batch module by allowing validators to be set for each batch.
By default, if no validators are provided, each line generated for a batch will be checked that it can be split back into the number of headers that exist for the batch. This ensures that the output DataFrame has the same shape as the input DataFrame.
If you wish to configure a validator for a specific batch — for example, to balance a health dataset by only generating female gender records — you can do that like this:
Stay tuned for our Public Beta which makes several more libraries available, including our custom validator package.
Our custom validation package will automatically learn constraints in your data and enforce them during generation. This includes character sets, numerical boundaries, string patterns, and more! As always, we will have simple API interfaces and notebooks examples to help you get started.
Once we have created our DataFrameBatch instance, we need to go through the training and generation process. These functions are bundled into three core methods for: generating training data, training the models, and generating synthetic data:
Now you’ve got your synthetic data!
We hope this overview has been helpful. Please step through our sample notebook which walks through the entire process using a publicly available dataset.