Creating synthetic time series data

A step-by-step guide to creating high quality synthetic time-series datasets with Python.

TL;DR

In this post, we will create synthetic versions of a time-series dataset using Gretel.ai’s synthetic data library, visualize and analyze the results, and discuss several use cases for synthetic time series data.

Background

One of the biggest bottlenecks that we face as data scientists is thechallenge of not having enough data. Companies like Amazon have turned to synthetic data to generate the large amounts of training data required to support new languages for Alexa, researchers are experimenting with GANs to generate diverse synthetic images for medical research, and companies like Unity3D are applying their expertise in gaming environments and 3D assets to help you train models that can better perceive objects in the real world.

Whether we need historical data to test our algorithms, or need more data to build ML models that generalize better, the use of synthetic data to augment limited datasets is growing incredibly rapidly. One are that has been particularly challenging for creating realistic synthetic data has been time-series data, due to the need to maintain strict ordering, and the at times complex relationships between time and measurement values.

Let’s start with a quick definition of time series data. Time series data can be thought of as a series of measurements, or observations that are indexed in time order.

Getting started

Click here to view the code below on GitHub, or here to launch the interactive notebook in Google Colaboratory. You will want a box with a GPU for model training, else grab a cup of ☕. To get started running the example code, grab an API key (it’s free) from the Gretel console.

The training dataset

For our training dataset we will create a time series model at hourly intervals. For the Y axis, we will want a complex, but understandable measurement function to both test our synthetic model and enable quick visual analysis. For this, we can apply a sine wave function that repeats on a yearly basis.

Great, now we have our time series DataFrame. Hint: to experiment with different time intervals (e.g. hourly, per minute), try changing the freq in Pandas’ date_range() function above. Now, let’s visualize the time series!

Plotting the time series training set

Extract trend data

Next, we create the training set for our synthetic model. We could train directly on the time-series data, but we would just be training the model to memorize the measurements at a given time. A more intuitive approach is to train the model on trend data, or the amount that the measurements change over a time interval. The model then re-creates the time series trend data, which can be restored to the original format using a cumulative sum.

Train a model

Next, we will use gretel_synthetics and tensorflow to train a synthetic model on our test dataset. To ensure that our generated synthetic datasets match the time windows of our source dataset, we specify the date field as a seed value that can be provided at generation time.

Generate synthetic datasets

We can now use the model to generate any number of synthetic datasets. To match the time range of the original dataset, we’ll use Gretel’s seed_fieldsfunction, which allows you to pass in data to use as a prefix for each generated row. The code below creates 5 new datasets, and restores the cumulative sum from the trend data to match the original dataset.

Visualize results

Finally, let’s visualize the 5 synthetic datasets we created vs. the original training set to get a feel for how the synthetic time series data compares against the source data. Fortunately, Pandas DataFrame integration with Plot.ly’s graphing libraries makes this easy!

Synthetic time series datasets vs. the original training set

As we can see, the synthetic model did a good job of learning the sin function and temporal relationships in our source dataset. To take this example a step further, try adding in new variables and sine waves to synthesize, or try changing the neural network configuration parameters such as training epochs, learning rate, and generation temperature to experiment with different results. We’d love to hear what you try, and stay tuned for more complex examples in a future post!

Conclusion

At Gretel.ai we are super excited about the possibility of using synthetic data to create ML and AI models that are both ethically fair and generalize better against unknown data. We’d love to hear about your use cases- feel free to reach out to us for a more in-depth discussion in the comments, twitter, or hi@gretel.ai. Follow us to keep up on the latest trends with synthetic data!

Interested in training on your own dataset? Gretel-synthetics is free and open source, and you can start experimenting in seconds via Colaboratory.