Back to all posts

Create high quality synthetic data in your cloud with Gretel.ai and Python

Create differentially private, synthetic versions of datasets and meet compliance requirements to keep sensitive data within your approved environment.

Whether your concern is HIPAA for Healthcare, PCI for the financial industry, or GDPR or CCPA for protecting consumer data, being able to get started building without needing a data processing agreement (DPA) in place to work with SaaS services can significantly reduce the time it takes to start your project and start creating value. Today we will walk through an example using Gretel.ai in a local (your cloud, or on-premises) configuration to generate high quality synthetic models and datasets.

Set up your local environment

To get started you need just three things.

  1. Dataset to synthesize in CSV or Pandas Dataframe format
  2. Gretel.ai API key (it’s free)
  3. Local computer / VM / cloud instance

Recommended setup. We recommend the following hardware configuration: CPU: 8+ vCPU cores recommended for synthetic record generation. GPU: Nvidia Tesla P4 with CUDA 10.x support recommended for training. RAM: 8GB+. Operating system: Ubuntu 18.04 for GPU support, or Mac OS X (no GPU support with Macs).

See TensorFlow’s excellent setup guide for GPU acceleration. While a GPU is not required, it is generally at least 10x faster training on GPU than CPU. Or run on CPU and grab a ☕.

Generate an API key

With an API key, you get free access to the Gretel public beta’s premium features which augment our open source library for synthetic data generation with improved field-to-field correlations, automated synthetic data record validation, and reporting for synthetic data quality.

Log in or create a free account to Gretel.ai with a Github or Google email. Click on your profile icon at the top right, then API Key. Generate a new API token and copy to the clipboard.

gretel.ai synthetic data service with python SDKs
Generate an API key at https://console.gretel.cloud

Setup your system and install dependencies

We recommend setting up a virtual Python environment for your runtime to keep your system tidy and clean, in this example we will use the Anaconda package manager as it has great support for Tensorflow, GPU acceleration, and thousands of data science packages. You can download and install Anaconda here https://www.anaconda.com/products/individual.

Create the virtual environment

conda install python=3.8
conda create --name synthetics python=3.8 
conda activate synthetics # activate your virtual environment
conda install jupyter # set up notebook environment
jupyter notebook # launch notebook in browser

Install required Python packages

Install dependencies such as gretel-synthetics, Tensorflow, Pandas, and Gretel helpers (API key required) into your new virtual environment. Add the code samples below directly into your notebook, or download the complete synthetics notebook from Github.

Train the model and generate synthetic data

Load the source from CSV into a Pandas Dataframe, add or drop any columns, configure training parameters, and train the model. We recommend at least 5,000 rows of training data when possible.

Compare the source and synthetic datasets

Use Gretel.ai’s reporting functionality to verify that the synthetic dataset contains the same correlations and insights as the original source data.

# Preview the synthetic Dataframe
bundle.synthetic_df()
# Generate a synthetic data report
bundle.generate_report()
# Save the synthetic dataset to CSV
bundle.synthetic_df().to_csv('synthetic-data.csv', index=False)

Download your new synthetic dataset, and explore correlations and insights in the synthetic data report!

gretel.ai synthetic data correlations
Comparing insights between the source and synthetic datasets

Want to run through end to end?

Download our walkthrough notebook on Github, load the notebook in your local notebook server, connect your API key, and start creating synthetic data!

Conclusion

At Gretel.ai we are super excited about the possibility of using synthetic data to augment training sets to create ML and AI models that generalize better against unknown data and with reduced algorithmic biases. We’d love to hear about your use cases- feel free to reach out to us for a more in-depth discussion in the comments, twitter, or [email protected]. Like gretel-synthetics? Give us a ⭐ on Github!

This article was originally shared by
Towards Data Science

Similar posts

View all posts