Generate time-series data with Gretel’s new DGAN model

Announcing the open beta release of our DGAN model type.

Intro

Gretel is excited to release our first dedicated time-series model in our API. Time-series data, where correlations across time are crucial, are found everywhere, from smart home sensors to financial data. With our new DGAN model, you can use the DoppelGANger model to generate synthetic time-series with the convenience of Gretel’s API.

For details on the model, please see our previous blog post about our PyTorch implementation of DoppelGANger. Now, you can use that implementation directly through Gretel’s API, the same way you use Gretel-LSTM or Gretel-GPT now – just modify the YAML configs as detailed below. Please note this is an open beta release, so there may be a few rough edges and we’re eager to hear your feedback to improve the experience (join our Slack or Discord communities).

The rest of this blog walks through an example demonstrating basic usage of the DGAN model with our Python SDK. We’ll show a few key highlights of the code and process in this blog. Follow along and see every line of code in our DGAN blueprint jupyter notebook.

Data

Let’s use daily crude oil prices for the running example in this blog. Suppose you’re an analyst at an energy company tasked with assessing the risk of different strategies for extracting, storing, and selling crude oil. You don’t know what’s going to happen next week or year, but you can look at historical trends. But there’s only one history to look at, so you want to generate synthetic data for oil prices to get a larger set of simulations to evaluate.

First, we need some existing data to train the synthetic model. The WTI and Brent spot price history from https://datahub.io/core/oil-prices will work nicely. These are reference prices for two types of crude oil: West Texas Intermediate and Europe Brent. See Figure 1 for a snippet of the data in tabular form. There are three columns: the date, and daily prices for WTI and Brent from 1987 to 2022. The entire 35 year history of prices are shown in Figure 2.

Figure 1: Table of oil price data.
Figure 2: WTI and Brent prices from 1987 to 2022.

DGAN requires many examples of a time series to train (for best results we recommend at least 100 examples), but we just have one historical set of prices here. The most common way to address this is to split the long history into smaller subsequences, and this is directly supported in our API via the max_sequence_len parameter. Here, we chose 10-day sequences to train and generate with two weeks of prices, see Figure 3 for examples. With 35 years of daily price data, we have sufficient two-week examples for training (~900) and two weeks seems like a plausible starting point for a short-term risk analysis by our hypothetical analyst.

Figure 3: 10-day training examples from the oil price history.

Other data setups are possible, and the downstream task and goals for the synthetic data should inform the data setup. For example, if recent price trends are critical, then you might look for a data set with prices every minute (or faster) for the last 12 months and split by day. We’re happy to discuss these different data setups if you have any questions.

Model training

To train the model, we’ll use the config below. This is based on our DGAN config template with a few tweaks.

schema_version: 1.0
name: "dgan-oil-data"
models:
  - timeseries_dgan:
        data_source: "_"
        time_column: "Date"
        df_style: "long"
        Params:
            max_sequence_len: 10            
            sample_len: 1            
            feature_noise_dim: 32
            feature_num_layers: 1
            feature_num_units: 100
            apply_feature_scaling: True
            apply_example_scaling: False
            generator_learning_rate: 1e-5
            discriminator_learning_rate: 1e-5
            attribute_discriminator_learning_rate: 1e-5
            batch_size: 100
            epochs: 5000            
            attribute_loss_coef: 10

The key places to change to use different training data are the time_column and the sample_len and max_sequence_len. For detailed explanations of these and other parameters, see the DGAN documentation.

The code snippet below uses that config and starts a model training in our Gretel Cloud. Run the cell and go brew some coffee - training for this oil data takes about 10 minutes.

project = get_project(display_name="DGAN-testing", create=True)
 
print(f"Follow model training at: {project.get_console_url()}")
 
model = project.create_model_obj(model_config=config)
 
df_input.to_csv(TMP_FILE, index=False)
model.data_source = TMP_FILE
 
model.submit(upload_data_source=True)
 
poll(model)

Synthetic data and analysis

As part of training the model, we automatically generated some synthetic data. Let’s download that data into a Pandas DataFrame for analysis (see Figure 4 for excerpt from the table).

synthetic_df = pd.read_csv(record_handler.get_artifact_link("data"), compression="gzip")
synthetic_df = synthetic_df.drop(columns = 'id_column')
synthetic_df
Figure 4: Table of synthetic oil price data.

A few things to note in this synthetic data output: the date column is just repeating the first 10 days from the original data and there’s a 4th column called example_id. Both are a product of the model generating 10-day long chunks at a time. First, we reuse an arbitrary date range from the training data for the date column. And the example_id values identify the different 10-day sequences in the generated data. Any downstream analysis should focus within those 10-day chunks.

While returned all together as a single CSV file (or read into a DataFrame), we do not recommend treating this as a single, long history like the original input data because there are discontinuities every 10 days. If longer sequences of coherent data are required, try using a larger max_sequence_len in the config. The Data parameters documentation contains more info.

Selecting a few example_id values to plot, Figure 5 contains some synthetic oil price data produced by the model.

Figure 5: 10-day samples of oil prices from synthetic data.

These simulated price movements are a great start for a more robust risk assessment by our analyst. Of course, some iteration is usually needed as this DGAN model produces price histories that might have come anytime in the last 35 years and only for two weeks.

Besides visually checking if the synthetic data “look similar, but not identical” to the training data, there are many quantitative ways to evaluate and use the generated data. Just remember that the best temporal correlations are always within the max_sequence_len rows of a single example_id value. A few visualizations of the quality of the synthetic data are provided in the notebook. We’ll just show one evaluation of the temporal dynamics of the generated data compared to the training data. Figure 6 uses the autocorrelation of each variable with itself at different time lags.

Figure 6: Autocorrelation of oil prices from real and generated data.

We see the autocorrelations for real and generated data are similar, so the DGAN model is effectively learning how prices usually change every one, two, three, etc. days. This is one indicator that we have high quality synthetic data that is appropriate for a day-to-day risk assessment task. 

This oil data and config above demonstrates the simplest usage of our DGAN model. More advanced setups are also supported, such as providing your own example_id column as mentioned above, using fixed attributes that do not vary over time, and categorical variables. See our documentation for more info on how to take advantage of these features.

Also please note this is an open beta model, and thus has a few rough edges. One issue we’ve encountered is the training can be insatiable, so if a run does not produce nice results like the plots shown above, sometimes just retraining with the same config will help.

Summary

With Gretel Blueprints, you can use our APIs to generate synthetic time-series data in minutes. Sign up for a free account at console.gretel.cloud, load the notebook with a Jupyter runtime, and start generating synthetic time-series data today.

As an open beta release, we would love to hear how the DGAN model is working for you. Several improvements for DGAN are planned, including Console support, integration with Gretel Evaluate, better default configuration to address training instability, and more.