Generate synthetic data in 3 lines of code

Learn the simplest way to generate synthetic data without setting up your own infrastructure and GPUs.

Published by

No items found.

•

Updated

August 24, 2022

We are super excited to introduce the gretel-trainer, an interface designed to be the simplest way to generate synthetic data, and a preview of new features coming to Gretel’s production SDK & API. Jason Warner, formerly CTO at GitHub and VP of Engineering at Heroku, has always had some advice that stuck with me — “If it’s 10 lines of code, make it 5. If it’s 5 lines of code, make it 1”.

In today’s post, we’ll walk through a code example for generating synthetic data using gretel-trainer and how you can use it to create synthetic data for anything from the CSV you have on your computer to datasets with millions of rows of data and thousands of columns. Try out the code below, or follow along step-by-step with our notebook in Colab.

First, start with installing the gretel-trainer library.

pip install -Uqq gretel-trainer

Below is the simplest path to creating synthetic data. This code uses Gretel’s APIs to train a deep learning generative model on the popular US adult income dataset and to create an artificial, synthetic equivalent version.

from gretel_trainer import trainer

dataset = "https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv"

# Simplest example
model = trainer.Trainer()
model.train(dataset)
print(model.generate())

Want to load an existing trained model, and use it to generate more data? That’s easy too!

# Or, load and generate data from an existing model

model = trainer.Trainer.load()
model.generate(num_records=70)

Need to customize your synthetic model? Selecting an underlying model and tuning model parameters is easier than ever.

# Specify underlying model and config options.
# configs can be either a string, dict, or path

from gretel_trainer.models import GretelLSTM

model_type = GretelLSTM(config="synthetics/default")

# Optionally update model params from a base config
model_type.update_params({'epochs': 50})

model = trainer.Trainer(model_type=model_type)
model.train(dataset)
model.generate()

Since the gretel-trainer uses Gretel’s fully managed cloud service for model training and generation, you can create state-of-the-art synthetic data without needing to set up or manage infrastructure and GPUs. Try running our Colab notebook, and for the next steps, try running on one of your own datasets or CSVs. Have questions? Ask for help on Gretel.ai’s community Slack.

What’s next: In part 2 of this post, we will dive into more advanced use cases using gretel-trainer to modify distributions in underlying data — a valuable technique for improving ML detection for fraud in cyber and financial datasets using conditional data generation.