Create high quality synthetic data in your cloud with Gretel.ai and Python
Create differentially private, synthetic versions of datasets and meet compliance requirements to keep sensitive data within your approved environment.
Whether your concern is HIPAA for Healthcare, PCI for the financial industry, or GDPR or CCPA for protecting consumer data, being able to get started building without needing a data processing agreement (DPA) in place to work with SaaS services can significantly reduce the time it takes to start your project and start creating value. Today we will walk through an example using Gretel.ai in a local (your cloud, or on-premises) configuration to generate high quality synthetic models and datasets.
Set up your local environment
To get started you need just three things.
- Dataset to synthesize in CSV or Pandas Dataframe format
- Gretel.ai API key (it’s free)
- Local computer / VM / cloud instance
Recommended setup. We recommend the following hardware configuration: CPU: 8+ vCPU cores recommended for synthetic record generation. GPU: Nvidia Tesla P4 with CUDA 10.x support recommended for training. RAM: 8GB+. Operating system: Ubuntu 18.04 for GPU support, or Mac OS X (no GPU support with Macs).
See TensorFlow’s excellent setup guide for GPU acceleration. While a GPU is not required, it is generally at least 10x faster training on GPU than CPU. Or run on CPU and grab a ☕.
Generate an API key
With an API key, you get free access to the Gretel public beta’s premium features which augment our open source library for synthetic data generation with improved field-to-field correlations, automated synthetic data record validation, and reporting for synthetic data quality.
Setup your system and install dependencies
We recommend setting up a virtual Python environment for your runtime to keep your system tidy and clean, in this example we will use the Anaconda package manager as it has great support for Tensorflow, GPU acceleration, and thousands of data science packages. You can download and install Anaconda here https://www.anaconda.com/products/individual.
Create the virtual environment
Install required Python packages
Install dependencies such as gretel-synthetics, Tensorflow, Pandas, and Gretel helpers (API key required) into your new virtual environment. Add the code samples below directly into your notebook, or download the complete synthetics notebook from Github.
Train the model and generate synthetic data
Load the source from CSV into a Pandas Dataframe, add or drop any columns, configure training parameters, and train the model. We recommend at least 5,000 rows of training data when possible.
Compare the source and synthetic datasets
Use Gretel.ai’s reporting functionality to verify that the synthetic dataset contains the same correlations and insights as the original source data.
Download your new synthetic dataset, and explore correlations and insights in the synthetic data report!
Want to run through end to end?
Download our walkthrough notebook on Github, load the notebook in your local notebook server, connect your API key, and start creating synthetic data!
At Gretel.ai we are super excited about the possibility of using synthetic data to augment training sets to create ML and AI models that generalize better against unknown data and with reduced algorithmic biases. We’d love to hear about your use cases- feel free to reach out to us for a more in-depth discussion in the comments, twitter, or [email protected]. Like gretel-synthetics? Give us a ⭐ on Github!