Introducing Gretel Amplify
Generate large volumes of tabular synthetic data at high speed.
We’re excited to announce the release of the Gretel Amplify model. Amplify is designed to rapidly generate large volumes of tabular synthetic data using statistical models and a hyper-efficient multi-processing implementation. Amplify runs on CPU and can generate data up to 1000x faster than deep learning-based generative models, enabling you to create large numbers of synthetic records very quickly. While Amplify is effective at learning and recreating distributions and correlations, it typically has a 10-15% drop in synthetic data accuracy versus Gretel’s deep learning models.
Some use cases for Amplify include:
- Creating large amounts of synthetic data to load test an application;
- Mimicking real-world data for pre-production environments;
- Generating synthetic examples to test a ML model’s ability to generalize to new data.
How fast is Amplify? Let’s find out. In this blog, we’ll walk through how to generate 5000 MB of data to demonstrate the speed of Amplify using a cloud worker with a quad-core CPU. You can follow along with our Gretel Amplify demo notebook located here or here:
Load and preview data
After inputting our Gretel API key, we start by specifying the input data we want to amplify. These can be local files, web locations, or Pandas DataFrames. In this example, our input data is a United States Census dataset on adult income from Kaggle. Here, we are amplifying real-world data, but Amplify can be used to generate large quantities of data from synthetic data as well.
The notebook shows a preview of the input data, along with basic information about the dataset.
Set target output size
There are two ways to indicate the amount of data you want Amplify to generate. You can use the `num_records` config parameter to tell the model to produce a certain number of records. Or, you can use the `target_size_mb` parameter to designate the desired size in megabytes of your output data. The maximum value for `target_size_mb` is 5000 (5 GB). Full details about the Amplify configuration file and parameters can be found here. For our demo, we want to generate 5000 MB of data, so we’ll set `target_size_mb = 5000`.
Create and run the amplify model
Creating and submitting our model takes just two lines of code.
Now, our model is off and running! We can use the helper function `poll()` to track the model’s progress.
About 30 minutes later, Amplify has finished generating! Now it’s time to unzip and view our results.
Let’s get to the good part! How fast is fast? You can use the wrapper function below to extract and print stats from the model.
In this example, Gretel Amplify generated 5010 MB of data in under 18 minutes, with a generation rate of 4.7 MBps. That’s 46,200,000 records at a speed of over 43,300 records per second. At that rate, you could generate 3.75 billion records a day using a single cloud worker! The total time to deliver the output data, which includes generation, artifact creation, and artifact upload, was just 33 minutes.
Because Amplify employs multi-processing, its speed is roughly proportional to the number of CPUs used. In this demo, we ran on a single Gretel cloud worker using a quad-core CPU. Want to run even faster? Try running Amplify locally using more cores and see how fast you can generate! If you need more speed, you can scale horizontally by running Amplify on multiple workers, and/or get in touch with us at firstname.lastname@example.org to get a sneak peek of our upcoming turbocharged workers with more CPUs.
The Gretel Amplify model is a powerful tool for generating large quantities of data in a flash. The Amplify model is available via our Console, CLI, and SDK and can be run in the Gretel Cloud or locally. For more information about Gretel Amplify, check out our docs.