Generate synthetic data in 3 lines of code
We are super excited to introduce the gretel-trainer, an interface designed to be the simplest way to generate synthetic data, and a preview of new features coming to Gretel’s production SDK & API. Jason Warner, formerly CTO at GitHub and VP of Engineering at Heroku, has always had some advice that stuck with me — “If it’s 10 lines of code, make it 5. If it’s 5 lines of code, make it 1”.
In today’s post, we’ll walk through a code example for generating synthetic data using gretel-trainer and how you can use it to create synthetic data for anything from the CSV you have on your computer to datasets with millions of rows of data and thousands of columns. Try out the code below, or follow along step-by-step with our notebook in Colab.
First, start with installing the gretel-trainer library.
Below is the simplest path to creating synthetic data. This code uses Gretel’s APIs to train a deep learning generative model on the popular US adult income dataset and to create an artificial, synthetic equivalent version.
Want to load an existing trained model, and use it to generate more data? That’s easy too!
Since the gretel-trainer uses Gretel’s fully managed cloud service for model training and generation, you can create state-of-the-art synthetic data without needing to set up or manage infrastructure and GPUs. Try running our Colab notebook, and for the next steps, try running on one of your own datasets or CSVs. Have questions? Ask for help on Gretel.ai’s community Slack.
What’s next: In part 2 of this post, we will dive into more advanced use cases using gretel-trainer to modify distributions in underlying data — a valuable technique for improving ML detection for fraud in cyber and financial datasets using conditional data generation.