Create high quality synthetic data in your cloud with Gretel.ai and Python

Create differentially private, synthetic versions of datasets and meet compliance requirements to keep sensitive data within your approved environment.

Published by

Alex Watson

No items found.

•

Updated

September 15, 2020

Whether your concern is HIPAA for Healthcare, PCI for the financial industry, or GDPR or CCPA for protecting consumer data, being able to get started building without needing a data processing agreement (DPA) in place to work with SaaS services can significantly reduce the time it takes to start your project and start creating value. Today we will walk through an example using Gretel.ai in a local (your cloud, or on-premises) configuration to generate high quality synthetic models and datasets.

Set up your local environment

To get started you need just three things.

Dataset to synthesize in CSV or Pandas Dataframe format
Gretel.ai API key (it’s free)
Local computer / VM / cloud instance

Recommended setup. We recommend the following hardware configuration: CPU: 8+ vCPU cores recommended for synthetic record generation. GPU: Nvidia Tesla P4 with CUDA 10.x support recommended for training. RAM: 8GB+. Operating system: Ubuntu 18.04 for GPU support, or Mac OS X (no GPU support with Macs).

See TensorFlow’s excellent setup guide for GPU acceleration. While a GPU is not required, it is generally at least 10x faster training on GPU than CPU. Or run on CPU and grab a ☕.

Generate an API key

With an API key, you get free access to the Gretel public beta’s premium features which augment our open source library for synthetic data generation with improved field-to-field correlations, automated synthetic data record validation, and reporting for synthetic data quality.

Log in or create a free account to Gretel.ai with a Github or Google email. Click on your profile icon at the top right, then API Key. Generate a new API token and copy to the clipboard.

gretel.ai synthetic data service with python SDKs — Generate an API key at https://console.gretel.cloud

‍

Setup your system and install dependencies

We recommend setting up a virtual Python environment for your runtime to keep your system tidy and clean, in this example we will use the Anaconda package manager as it has great support for Tensorflow, GPU acceleration, and thousands of data science packages. You can download and install Anaconda here https://www.anaconda.com/products/individual.

Create the virtual environment

conda install python=3.8
conda create --name synthetics python=3.8 
conda activate synthetics # activate your virtual environment
conda install jupyter # set up notebook environment
jupyter notebook # launch notebook in browser

Install required Python packages

Install dependencies such as gretel-synthetics, Tensorflow, Pandas, and Gretel helpers (API key required) into your new virtual environment. Add the code samples below directly into your notebook, or download the complete synthetics notebook from Github.

Train the model and generate synthetic data

Load the source from CSV into a Pandas Dataframe, add or drop any columns, configure training parameters, and train the model. We recommend at least 5,000 rows of training data when possible.

Compare the source and synthetic datasets

Use Gretel.ai’s reporting functionality to verify that the synthetic dataset contains the same correlations and insights as the original source data.

# Preview the synthetic Dataframe
bundle.synthetic_df()
# Generate a synthetic data report
bundle.generate_report()
# Save the synthetic dataset to CSV
bundle.synthetic_df().to_csv('synthetic-data.csv', index=False)

Download your new synthetic dataset, and explore correlations and insights in the synthetic data report!

gretel.ai synthetic data correlations — Comparing insights between the source and synthetic datasets

‍

Want to run through end to end?

Download our walkthrough notebook on Github, load the notebook in your local notebook server, connect your API key, and start creating synthetic data!

Conclusion

At Gretel.ai we are super excited about the possibility of using synthetic data to augment training sets to create ML and AI models that generalize better against unknown data and with reduced algorithmic biases. We’d love to hear about your use cases- feel free to reach out to us for a more in-depth discussion in the comments, twitter, or hi@gretel.ai. Like gretel-synthetics? Give us a ⭐ on Github!

Get started with Gretel

Learn how our platform can streamline your data privacy and compliance.

Want to learn more?

Discover the benefits of synthetic data.

Request a demo

Generate synthetic data at scale

Synthesize millions of rows with high accuracy using Gretel ACTGAN.

Scale your data