Synthetic Data Configuration Templates

Our new configuration templates will help you pick some of the right parameters needed to train your synthetic data models.

Have you ever gone to a restaurant or a bar and been given blank menu? That would be awkward right?

Menus are great because they give you options, even if you aren’t sure what you are looking for.

Well we’ve got some options for you!

If you have seen our CHANGELOG blog post, then you know that we’re actively working on creating an API driven approach to privacy engineering, backed by Gretel Cloud.

Part of this effort revolves around a configuration driven approach. Instead of having to build data structures in a specific programming language, like Python, users will be able to construct their privacy workloads in configuration files that can be sent to our REST API. From here, a Gretel Worker can consume the configuration from the API and do the heavy lifting. A Gretel Worker can consume a job request from our REST API and begin processing data in Gretel Cloud or your own environment.

Configuration Templates

Over the last several weeks, we’ve spent a lot of time with customers helping create optimal Synthetic Configuration parameters based on the type of training data that is being used. Often times the training data can have a smaller number of records, larger number of records, be all numeric, have a lot of free text data, etc. It’s important to get the right configuration options to get the best possible synthetic data.

We’ve found some pretty good synthetic data configurations and while we’re building for Beta2, we made the decision to start tracking these configurations and make them available as templates that can be consumed and edited just before training a model.

When Beta2 releases, configurations can be authored in YAML or JSON and pushed to our REST API. The configurations for Gretel Synthetics can be found on GitHub.

Take a look at the README to explore the various configuration templates. Additionally, each template has some comments with more detail on the rationale behind the various parameter settings for each use case.

Backwards compatibility with current Gretel Beta

If you have been using our current Blueprints, you can use these configuration parameters today. You may recognize some of the params from the template YAML files as the keys in the synthetic config_template from the current blueprints.

Here’s what you are used to seeing in blueprints today:

checkpoint_dir = str(Path.cwd() / "checkpoints")

config_template = {
    "checkpoint_dir": checkpoint_dir,
    "vocab_size": 20000,
    "overwrite": True
}

You may simply update the keys in the config_template dictionary with the key / value pairs defined in the param object from the new template files.

However, we’ve added a quick helper to do this automatically in our gretel-client.  Gretel client is already installed as part of our blueprints, so as long as you get the latest version you can start using this today.  Our Synthetic Data blueprint has already been updated to use this new helper. With the helper you can just provide the name of the configuration template (without the file extension) and the configuration dictionary will be loaded for you:

from gretel_client import get_synthetics_config 
    
config_template = get_synthetics_config("low-record-count")

# modify config_template dict if need be

model = SyntheticDataBundle(
    synthetic_config=config_template,
    # ...
)

Summary

Aren’t options great? No matter what your data looks like, we hope some of our configuration templates will help you pick some of the right parameters needed to train your synthetic data models. You can always try experimenting with some of the parameters once you’ve loaded a template.

As we make progress with Beta2 we’ll continue to post updates!

We’d love to hear how these templates work. As usual, send us an email at hi@gretel.ai or come join us in Slack.