Scale Synthetic Data to Millions of Rows with ACTGAN
Whether you’re building better machine learning training data or creating safe pre-production environments for experimentation and testing, synthetic data has become an essential tool for businesses that want to leverage the power of data without risking sensitive information. Companies and developers looking to operationalize synthetic data are faced with the challenge of scaling beyond small sample training data with tens of thousands of rows and tens of columns to datasets that have tens of millions of rows and hundreds of columns.
Unfortunately, when many developers start training synthetic data at scale, they quickly hit roadblocks due to the scale of their data exceeding the capabilities of generative models that are available. At Gretel, we understand these requirements and have built a variety of synthetic data models to meet the demands of businesses looking to generate synthetic data at scale.
In this post, we'll focus on one of our models called Gretel ACTGAN (Anyway Conditional Tabular Generative Adversarial Network), which excels at generating synthetic data for highly dimensional tabular datasets such as those often used in the ads, cyber security, financial, and life sciences verticals.
About a year ago, we began working with an excellent open-source synthetic data model called CTGAN, which was originally presented at the NeurIPS 2019 conference by the paper titled Modeling Tabular data using Conditional GAN. While CTGAN works well with datasets with up to hundreds of columns, developers using our platform quickly hit limitations when scaling to datasets with millions of rows. Due to the significant changes required to the model architecture to support this scale, and concerns that the open-source license would change to something more restrictive, we opted to fork a copy into a new model called ACTGAN in the gretel-synthetics library. ACTGAN builds on the foundation of CTGAN and enhances it with new features that make synthetic data generation significantly more efficient and accurate for machine learning use cases.
It's worth noting that recent changes to the SDV CTGAN model licensing have made it difficult for enterprises to use the model in their products without having to purchase a license, including restrictions for “productionizing the SDV within an enterprise” or “building and selling software that uses the SDV.” In contrast, Gretel ACTGAN is available under a source-available license, which restricts its use only in building competing synthetic data services.
ACTGAN’s improvements for memory and accuracy
ACTGAN significantly reduces CPU and GPU memory requirements, enabling faster training and larger datasets. We accomplished this reduction through using a new Binary Encoder in addition to one-hot encoding, and a more efficient internal representation of training data. Users can now synthesize much larger datasets without upgrading to higher memory GPUs or experiencing lengthy training times.
Another significant improvement made to ACTGAN is the automatic detection and transformation of date-times. This feature enables synthetic dates to be sampled from a distribution rather than being treated as categorical. This results in more realistic synthetic data that accurately reflects the underlying distribution of the original dataset.
Finally, ACTGAN also includes improvements to conditional vector sampling, which greatly enhances the accuracy of conditional data generation. This feature is particularly useful for generating new labeled examples for machine learning datasets.
An experiment to examine ACTGAN’s memory improvements
Let’s take a look at how these memory improvements translate into support for larger datasets and faster, more accurate training. We'll use a 311 Call Center dataset from Kansas City, MO’s open data project, which has a mix of categorical and numeric data. Feel free to follow along using this Colab notebook.
First, let’s compare wall-clock time and GPU memory consumption using a sample of 5000 records from the dataset on a Colab instance with 4 cores, 16GB system RAM, and an NVIDIA V100 GPU. In addition to the 6.4x speedup below, ACTGAN requires only 16% of the GPU memory footprint of CTGAN for this dataset (835 MB vs 13,769 MB).
- CTGAN: 1025 seconds to synthesize 5k records of call center data
- ACTGAN: 161 seconds to synthesize 5k records of call center data (6.4x speedup)
For the second experiment, we'll run more at scale, processing 100k records of data and increasing the batch size for our GPU. Since we are running 20x more data through, we can reduce the epochs to 25 for this experiment. In this example, ACTGAN scales to a dataset size 20x larger than the first experiment, with only a minimal increase in memory requirements and runtime.
- ACTGAN: 179 seconds to synthesize 100k records of call center data.
- CTGAN: system out-of-memory (OOM) error attempting to allocate 74.5 GiB of system memory.
ACTGAN builds on the foundation of CTGAN and introduces new features that make synthetic data generation scale with the needs of enterprises. Its more permissive licensing also makes it a more accessible choice for businesses that want to generate synthetic data in-house. Check out ACTGAN today by running the open source example, or get started in minutes using the fully featured Gretel synthetic data platform.