Introducing Gretel Amplify

Generate large volumes of tabular synthetic data at high speed.

Introduction

We’re excited to announce the release of the Gretel Amplify model. Amplify is designed to rapidly generate large volumes of tabular synthetic data using statistical models and a hyper-efficient multi-processing implementation. Amplify runs on CPU and can generate data up to 1000x faster than deep learning-based generative models, enabling you to create large numbers of synthetic records very quickly. While Amplify is effective at learning and recreating distributions and correlations, it typically has a 10-15% drop in synthetic data accuracy versus Gretel’s deep learning models. 

Some use cases for Amplify include:

  • Creating large amounts of synthetic data to load test an application;
  • Mimicking real-world data for pre-production environments;
  • Generating synthetic examples to test a ML model’s ability to generalize to new data.

How fast is Amplify? Let’s find out. In this blog, we’ll walk through how to generate 5000 MB of data to demonstrate the speed of Amplify using a cloud worker with a quad-core CPU. You can follow along with our Gretel Amplify demo notebook located here or here:

Load and preview data

After inputting our Gretel API key, we start by specifying the input data we want to amplify. These can be local files, web locations, or Pandas DataFrames. In this example, our input data is a United States Census dataset on adult income from Kaggle. Here, we are amplifying real-world data, but Amplify can be used to generate large quantities of data from synthetic data as well. 

url = 'https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/us-adult-income.csv'
df = pd.read_csv(url)
print('\033[1m'+ "Input Data - US Adult Income" +'\033[0m')
print('Number of records: {}'.format(len(df)))
print('Size: {:.2f} MB'.format(df.memory_usage(index=True).sum()/1e6))
df

The notebook shows a preview of the input data, along with basic information about the dataset. 

Figure 1: Preview of input dataset - Adult Income from 1994 United States Census

Set target output size

There are two ways to indicate the amount of data you want Amplify to generate. You can use the `num_records` config parameter to tell the model to produce a certain number of records. Or, you can use the `target_size_mb` parameter to designate the desired size in megabytes of your output data. The maximum value for `target_size_mb` is 5000 (5 GB). Full details about the Amplify configuration file and parameters can be found in our Amplify model creation docs. For our demo, we want to generate 5000 MB of data, so we’ll set `target_size_mb = 5000`.

# Pull Amplify model config 
config = read_model_config("https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/config_templates/gretel/synthetics/amplify.yml")

# Set config parameters
config['models'][0]['amplify']['params']['target_size_mb'] = 5000         # 5 GB
config['name'] = "amplify-demo"

Create and run the amplify model

Creating and submitting our model takes just two lines of code.

# Designate project
project = create_or_get_unique_project(name="amplify")

# Create and submit model 
model = project.create_model_obj(model_config=config, data_source=df)
model.submit_cloud()
poll(model)

Now, our model is off and running! We can use the helper function `poll()` to track the model’s progress.

Figure 3: Logs displayed by poll function during Amplify model run

‍

View results

About 30 minutes later, Amplify has finished generating! Now it’s time to unzip and view our results. 

amp = pd.read_csv(model.get_artifact_link("data_preview"), compression="gzip")
amp
Figure 4: Preview of Amplified data with 46,200,000 records

Let’s get to the good part! How fast is fast? You can use the wrapper function below to extract and print stats from the model.  

stats(model)

In this example, Gretel Amplify generated 5010 MB of data in under 18 minutes, with a generation rate of 4.7 MBps. That’s 46,200,000 records at a speed of over 43,300 records per second. At that rate, you could generate 3.75 billion records a day using a single cloud worker! The total time to deliver the output data, which includes generation, artifact creation, and artifact upload, was just 33 minutes.

Figure 5: Amplify model result stats

Because Amplify employs multi-processing, its speed is roughly proportional to the number of CPUs used. In this demo, we ran on a single Gretel cloud worker using a quad-core CPU. Want to run even faster? Try running Amplify locally using more cores and see how fast you can generate! If you need more speed, you can scale horizontally by running Amplify on multiple workers, and/or get in touch with us at hi@gretel.ai to get a sneak peek of our upcoming turbocharged workers with more CPUs.

Conclusion

The Gretel Amplify model is a powerful tool for generating large quantities of data in a flash. The Amplify model is available via our Console, CLI, and SDK and can be run in the Gretel Cloud or locally. For more information about Gretel Amplify, check out our docs. 

How will you use Amplify? Let us know! Feel free to reach out at hi@gretel.ai or join our Slack Community and share your thoughts.Â