Presentation

Deep Dive on Synthetic Time-Series Data with Gretel.ai

June 15, 2022

The Gretel.ai team takes a deep dive into using #GANs to generate synthetic time-series data.

Read the blog post

Transcription

Mason: Good afternoon, or evening, or morning from wherever you may be joining us, and welcome to another Gretel live event. My name is Mason Egger, and I am one of the developer advocates here at Gretel. Joining me today is Alex Watson. For those of you that have ever watched me stream, I can't point in reverse. So he's over here on my side-

Alex: You're right.

Mason: Our glorious CPO. And then we have Kendrick Boyd at the bottom, one of our principal machine learning scientists. So, it's great to have you two here today.

Mason: Just a quick rundown of events. Today, we're going to be talking about DoppelGANger, our new time-series synthetic data model. We'll have a couple of other people join us later, and we're going to have a live Q&A at the end. So if you have any questions during the presentation, drop them in the chat, and we will try to answer as many of, if not all of them, at the end of the presentation.

Mason: Thank you for being here. So great to have you here with our community and we're dedicated to learning in public. So hopefully you're here to learn with us and have a great time.

Mason: So I'm going to go ahead and I'm going to leave and turn it over. Alex and Kendrick are going to take you through the presentation. I'll see you all at the end, and Alex and Kendrick, take it away.

Alex: Thanks, Mason. I'm Alex Watson. I'm one of the co-founders here at Gretel, and I'm going to start with a quick overview of synthetic data and some of the use cases and pain points that we're addressing for customers. And then Kendrick will take us through an implementation of the DoppelGANger GAN, time-series data, and answer any technical questions that you might have around working with time-series data and synthetic data.

Alex: To start with the pain point that we see from our customers, the users of our platform, that they're using synthetic data to address. And it really centers around this bottleneck with data. There is an awesome survey that data science platform Kaggle puts out every year, and in recent surveys, one of the things that really stood out to us, at least, was that for most developers, whether you're a developer, you're a data scientist, that initial bootstrapping of data, just getting access to it or creating the data if it doesn't exist, can take up to 35% of your total project time, really making it one of the biggest bottlenecks that people face today when they're building new products.

Alex: Synthetic data is kind of an old idea. It's been around for a while, but has been made possible by advances in machine learning for us to create realistic data sets that are created by computer algorithms or models that are really a simulation of real-world data. And the benefits you have here can be increased privacy; it can be the ability to create data that doesn't exist by using models that are trained on large amounts of public data; or even addressing AI fairness and bias and things like that inside data sets, so really exciting use case.

Alex: And then down at the bottom here, we talk about three of the biggest pain points we see customers going after. When they're using synthetic data, what are you using it for? It can start with, at the top here, we're talking about meeting compliance concerns: whether you're GDPR or you're CCPA, the need to have anonymized data that doesn't point to a real human or one of your customers is really important. It's also traditionally very difficult to do. Synthetic data gives you a unique way to create a whole new data set that if you look at it, looks just the same. You can carry it, you can ask it questions, but it's not based on a real human's data. So this makes things like data sharing and access much easier.

Alex: Second, and this is a really neat one, we've even done a couple talks on it, is addressing bias and data sets. So often you have data, but just not enough of the right data. This could be a medical data set. We had a really neat example, working with a heart disease data set that was largely biased, unfortunately, for male patients. And the users that were building with their data set were asking if it was possible to balance out that data, and what effect did that have for downstream machine learning use cases? So really exciting results we had there.

Alex: And finally, not having any data at all, which is I think a problem that we're all familiar with, and the idea of using pre-trained models that have been trained on large amounts of public data to compose a new data set that meets your requirements is a really neat and pretty recent advancement with synthetic data. Next slide.

Alex: We started with this already. So talking a little bit about what synthetic data is, and here's some pretty nice definitions here. It's artificially annotated information generated by computer algorithms or simulations, used often as an alternative to real world data. So when we say something annotated, what does that mean? It means often that data is labeled. So you might have a text, utterance is pointing to a label for it, saying, "This is the sentiment associated with it," or things like that, which means that it's somewhat-structured data that you can reason about, because it gives you a little bit of context about the data. I'm going to let you take it from here, Kendrick. Go ahead. I think you're on mute, Kendrick.

Kendrick: That is correct. I had muted myself. I can't get through a zoom meeting or a presentation these days without a muting issue. So within that framework of synthetic models that Alex was just talking about, Gretel has a number of different models that we use or are in development for different, specific tasks.

Kendrick: A quick overview of those here: we have the Gretel LSTM, which is our main workhorse model that generally works pretty well on lots and lots of different data types. We have a recently released model, specifically dedicated for NLP model for text, that's the Gretel GPT. And then two upcoming models, one optimized for tabular data, CTGAN, that's upcoming. And then the one we're going to be talking about today is this DGAN or DoppelGANger model, and this is a dedicated time-series model.

Kendrick: So first off, a little bit of background, why a dedicated time-series model, if you followed Gretel's blog posts over the last year or so, we've had some existing blog posts talking about how to use the Gretel LSTM for time-series. But time-series data is everywhere these days in the world, from sensor data to medical records and everything in between. And with that really varied ranges of data, we want to be able to have different models for different data sets.

Kendrick: We want to provide options because different models excel in different areas; and overall, time-series data set synthesis is quite challenging, because you have to worry about both correlations between variables as you would in a standard tabular setup. If you want tech stocks, if you're thinking about synthesizing price data from the stock market, text stocks should generally move together, or other sectors should generally move together each day. But you also have correlations across time that you have to worry about that your model needs to be able to produce accurately. So that could be weekly and yearly trends that might happen due to patterns of work and such and electricity usage, or many other data sets.

Kendrick: And so that's this little bit of background about why we want to have this dedicated time-series model. And the model that we've chosen to work with here is this DoppelGANger model. There's a brief snapshot from the original paper that was published in 2019 that the paper titled, "Using GANs for Sharing Network Time-series Data Challenges, Initial Promise, and Open Questions." It's a really great paper. We loved reading about it and chose this paper for a couple different reasons. One, it's quite expressive, it's a very flexible model, can generalize to lots of different types of data, different data sets. And also, it's very good and effective at periodic data, data that has weekly or daily or hourly patterns to it. And that's a place where our current model, Gretel LSTM, can have some more difficulty. And so this is a great model to help fill in the gaps within our suite of models.

Kendrick: As part of this work, though, one of the things is that we've implemented a new PyTorch and implementation of this model, and it's now open sourced as part of our Gretel Synthetics library. And so we can talk a little bit later, you'll have to ask some questions about why we did that and all that, but we wanted to have something in a little bit more modern setup than the TF1 code that was published with the paper. But the TF1 code was really helpful as a reference implementation that we can compare against.

Kendrick: So before we get into too many more details about how to use the model, we want to talk a little bit about, for the purposes of this talk and this particular model here, what do we mean by a time-series? What's the data that we're going to be working with? The synthetic data that the model is going to be producing? What does that look like?

Kendrick: So first of all, we might start off with just one variable. Say we have some smart home data: you might have a temperature sensor that's in your kitchen collecting a temperature over time. So here we have three days of data there. That's the first basic step. You have one sensor that you have data from.

Kendrick: And then you might have multiple sensors. You are collecting temperature and humidity, now, from both your kitchen and your bathroom. So we have a multivariate time-series here. But here, this is still just one home, one collection of sensors. But for these neural network models, the GAN is the Generative Adversarial Network, we need to have lots of different examples in order to effectively train this model.

Kendrick: And so then, you might imagine that you can obtain multiple samples by either looking at different houses, putting lots of sensors in different houses across a particular country or the world, or you could take a look at different weeks, different time ranges in order to produce different examples like this.

Alex: Kendrick, could you touch maybe on some of the difficulties around multi-variate time-series, like what we're looking at here versus a single variable?

Kendrick: Sure. Let's see... Okay, I do have my mouse here as a pointer. Looking across these three temperatures, there's a couple different things going on here. One, there are patterns that happen each day, that there's generally it's warmer inside the house during the daytime. I don't know if the exact hours line up here because of times zones and such. This is data sampled from a house, actually, in Belgium. So there's daily, weekly patterns and things going on here. But then also, the individual variables need to be correlated, that if there's a spike in... I think this is the humidity here in the bathroom, so maybe someone took a bath or a shower at this time, you also see a spike in the temperature, as well, the way that this system happens to work.

Kendrick: And if you see the full data set here that's available with this, there are sensors in a number of other rooms, and so you can also see that rooms that are nearby in the house, physically, those temperatures tend to be correlated more, those humidities tend to be correlated more. The bathroom might open into the living room, and so those temperatures should be correlated more so than the bathroom versus the garage, which might be further away. And so all of those dynamics are things that you're wanting to capture within your synthetic model, and be able to reproduce accurately with new synthetic data.

Kendrick: Then another aspect and another thing that we really liked about the DoppelGANger model is that it also supports, essentially, metadata on each of these time-series examples, so that you can annotate and say, "Oh, this particular data set here is from maybe the season, it's from the winter, and the particular country it's from, it's from Belgium." And then you might have other examples that are from the spring or from different countries, for example.

Kendrick: And so this allows you to represent in your data and for the model to learn that, "Oh, if it's winter in some country near the equator, for example, maybe temperatures aren't that much different from the summer. But if it's winter in Canada, then that should have much colder outdoor temperatures, and in general, even your indoor temperatures would probably be a little bit lower." And so that's a way for the model to capture those correlations dependencies and learn about them.

Kendrick: All right. So with that... Oh, then finally, bringing that all together as a summary here. The types of data that DoppelGANger is going to be able to work with, be able to train on, that we need to have multiple examples. Now, certainly there are situations where you don't really have multiple examples naturally: you only have sensors in one house, or maybe you're looking at the history of something. There's one history of the full prices on the New York Stock Exchange, for example. So in those situations you can break up a long sequence into many smaller subsequences in order to have examples for the model to train with.

Kendrick: And then the other components of this input data are these fixed variables, attributes that don't train over time. And so for the training setup, this looks like a two-dimensional array of data like a standard tabular data set, but then you also have associated, linked with that, these time variables or features and that these are the ones that change over time. That training input you can think of as a three-dimensional array where the first dimension is the different examples that you have, the second dimension is time, and the third dimension is your multivariate, the different variables that you've measured. And so throughout this talk and throughout documentation in the paper, we refer to these, that the fixed variables are attributes and the time variables are features. Those two arrays constitute the main input to the DoppelGANger model.

Kendrick: So now we've talked a little bit about background of what types of data, what does the data look like for this model? We're going to jump into talking a bit about our PyTorch implementation. We've implemented this new version of this in PyTorch. It's available at the link there, and it you can accept Pandas Dataframes, or non-Py arrays directly as the input. And we've also done some extensive testing of this that's been just been recently... Today, actually... Published in a KDNuggets article, showing how this new implementation produces similar results from the original TF1 code. We've replicated several figures from the original paper in that article.

Kendrick: And another really nice aspect of this PyTorch version is that it's quite a bit faster for training. These training times here are on the data set from the original paper. It's a larger dataset, but the TF1 code took about 13 hours to train on an Nvidia T4, and with the PyTorch implementation, and tweaking the parameters to increase the batch size a bit, we can bring that training time down to about 20 minutes or so, so it's about a 40x speed up. That's really nice and able to improve how quickly you can iterate and try out different parameters, try out different setups for your data.

Kendrick: All right. And so with that, I'm going to jump into a little bit of a case study, work through what this might look like to set up this DoppelGANger model for a simple little data set here. All of this, the code for this is all available on a jupyter notebook that you can find at the grtl.ai/dgan-code link, if you would like to follow along here. I'll talk through this at a high level, and then I'm happy to answer any detailed questions that you'd like to add to the comments at the end.

Kendrick: So again, this is the same sensor data that I was talking about before. This is a data set that's up on the UCI repository, and it contains about four and a half months of observations from the one house in Belgium. These are sampled at 10-minute intervals, so there's really quite a bit of data here. And you can see at the top right there, there's a full view of the data. There's, I don't know, 30 or 40 columns here. There's lots of different measurements that were made both in the house and some stuff around the area.

Kendrick: And so for the purpose of this example, we're going to use four temperature columns: temperature from three different rooms within the house, and then another temperature that's measured from the nearby weather station, the outside temperature. And so on the bottom right, we can see those four temperatures graphed over the four and a half months. And that constitutes our main data.

Alex: Kendrick, when we go through this process and synthesize this data set, are we effectively creating another synthetic house and all the sensor readings for it? Or how do you think about that?

Kendrick: Yeah, that's a good question, and comes into this next step. Here we just have basically one example. In order to train the model, though, we're going to need to have multiple examples. And so the ways that we can do this, we could get sensor data from many other homes. That's going to be expensive, time consuming and so on; or we can split this four and a half months into smaller chunks. And so the choice that I've made here for this particular setup is to use one day of observations. That's 144 observations, 10 minutes each.

Kendrick: And so what the synthetic model is learning and what the synthetic model will produce at the end is one day of observations of these four temperature readings from this house. You could think of that either as generalizing to, "Oh, what might the temperature readings look like next year?" Or you could also think about it generalized to, "What might the temperature readings look like at a similar house that's nearby in the area?" And certainly if you collected other data from other homes or from other countries or things, you could think about it generally, being able to generalize more broadly.

Alex: This gets to the bootstrapping use case we were talking about earlier, where, let's say you're building an app to make a smart home type device or sensor, but you're limited by the initial data you have, you could use something like this to create additional examples for your algorithms?

Kendrick: Yes, that's correct.

Kendrick: Yeah. In this particular example here, then we have the attributes array that I was talking about before. There is no attributes array that we're using here, although you could certainly think about maybe adding something like that, maybe based on the time of year based on this data. And then for the features, we have this three-dimensional array, and there's 137 examples because we have data for 137 days. There's 144 time points for each example, and there's the four variables for the four temperature sensors that we are working with.

Kendrick: So with that data pre-processing finished out here, then we can go ahead and set up and train the model. There's the little snippet of the code that you can use to create the model and train the model on the right here. All the config details and such are documented at our documentation link, synthetics.docs.gretel.ai, and also within the Gretel synthetics repo code, as well.

Kendrick: In this particular case, because this is a pretty small data set, 137 examples and about 150 time points, it trains very quickly in 64 seconds. The other training set that I was talking about before had about 50,000 examples and 550 time points. In general, this DoppelGANger model... GAN models, I think, in general work best with more examples, so this is probably on the smaller end of where the model would really excel, but it's still a good example, and we can see what sort of data we get out of that.

Kendrick: All right, so now that we've trained the model, we can generate as many one-day snippets of temperatures as we would like to. That's the beauty of the GAN model, that you can pass in some noise and it gives back data that should look like the original data. So as a quick comparison here, the two plots on the right are two different examples that we have synthesized from our model. We can compare what that looks like compared to one of the original days that we're looking at here. There's some good things going on here, and there's some not-so-great things going on here. Some good things that are happening, the model's recognizing that the indoor temperatures are usually in 20 to 25 degrees Celsius, where people generally like their houses to be. And the outside temperature can vary; it's often quite a bit lower than that. And we also see that there are some changes in the outside temperature over the day. Not as many changes in the inside temperature, but still maybe sometimes seeing some, like this one room got a little bit colder during this day.

Kendrick: On the other hand, there's some other challenging things about this, as well, that the data doesn't look very smooth, and this is a little bit surprising because the input data was quite smooth. This is an aspect of something you could look into exploring, different ways of setting up the model, some of those different parameters that I mentioned that we have documented at synthetics.docs.gretel.ai, and possibly some different training with an appropriate learning rate might help with this. It's hard to trade that off when you have such a small data set to work with.

Kendrick: This is some visual inspection. We can also look at very high level, a couple of more numeric-based evaluation approaches we could take. And we can talk more in detail about this, but at a high level, the left side here is s looking at autocorrelation. If you look at one time point, how similar is one time point to 10 minutes before versus 20 minutes before, 30 minutes before and so on; and how does that correlation change over time?

Kendrick: In general, you want the real and synthetic lines here to be similar. For one of the temperatures, I think the temperature in the living room here, it did a pretty good job here with maybe tailing off with the long time ranges, not having quite as great a correlation, but then for another temp sensor that we had, the model didn't do particularly well. And so again, that's something that you could try out some different parameter settings to help improve this, or finding some more sensor data so that the model has more data to train from.

Kendrick: And then on the right hand side, this is a simple thing of looking at how well these values change together. Does T1, T2, and T3, the temperatures from inside the house... Generally are those close together? Do those correlate more? Whereas the temperature outside is further away. Here we have a correlation matrix for these four variables, and so in general, you want these values to be similar between what you had from the real data and what you had from the synthetic data. And these numbers are reasonably close, although there's certainly some room that perhaps a better model could do.

Kendrick: That's a high-level look at how you might apply this DGAN model to some sensor data. And with that, coming to the end of the presentation here, what are some of the next steps here for time-series at Gretel? First off, we have some further improvements for the open-source model that we have that are in progress here. We'd like to add variable-length time series support to that. This was described in the original paper and it's part of a TF1 code, but getting an initial version out, we had skipped that support. We can talk in more in detail about how exactly that happens, but we'd like to add that to it. And then there's also some other PRs that are in progress that should show up in the GitHub repo probably in the next week or two to allow more robust, more flexible support for different input types, input and data frames that you can pass in.

Kendrick: And then finally, we're working towards getting DoppelGANger available as something that you can use through the Gretel SDK and through the console that you can use the Gretel APIs with all the various features around that and running these models in the Gretel cloud. Look for an announcement for that probably in the next month or so.

Kendrick: But right now you can use the open source code yourself, and you can start out from these two example notebooks: one, the home sensor data that I walked through today, or the Wikipedia page view data that was from the original paper. And hopefully we can get some links to those in the chat if you'd like to go check those out.

Kendrick: And so with that, I think we're ready to switch over to the question section. I'm happy to talk about questions about this particular data set, about the model, about the PyTorch implementation, or just generically about Gretel. And I think we have a couple other people who'll be joining us for the Q and A session, so there may be a few moments as we're getting situated here.

Mason: Yes, we are back. Well, that was a great presentation. Thank you for that, Kendrick. Now we're moving into the Q and A session. We have some predefined questions that I'm going to let Lipika ask and discuss with Kendrick and Alex, but also if you have any questions that you want, please drop them in the comments, either on YouTube or LinkedIn. We'll definitely try to get to all your questions today. So I think what we're going to do is I'm going to bring up the question, and then Lipika, I'll let you take it from there. I'll bring it up on the little shashasha thing.

Lipika: Awesome. Thanks, Mason. Hey everyone, I'm Lipika, I'm a machine learning scientist here, yet another one, and we'll jump into questions. When should you use DoppelGANger versus Gretel Synthetics? I imagine by Gretel Synthetics, we mean the LSTM model, our workhorse model. I can provide a one-liner, which is when you have data that's collected many different series, so if you had data from houses all across Belgium, Egypt, Canada, the US, and you had a lot of data collected from all these houses and multiple series, DoppelGANger would be a very good set because that's a lot of underlying distributions to learn, that DoppelGANger does well in the time-series realm. Whereas if you simply had, I don't know, information on stock prices for one or two stocks, using something like Gretel Synthetics... The LSDM in Gretel Synthetics is a good bet because there is some upward trend, downward trend, long term, those types of fluctuations that Gretel Synthetics can handle really well. And so the underlying model is an LSTM, which is a recorder [inaudible 00:29:52], so that handles that type of thing really well. Any other examples you all can think of, Kendrick, Alex?

Alex: Maybe for a couple examples that I've seen as we've been building out the time series work here, and then financial markets that Kendrick alluded to earlier, like the open, high, low, close data, we have so much financial information, but there's only one New York Stock Exchange. So one of the things that we see often with financial customers is they're building models to predict market changes like the 2008 housing crash or the GameStop thing that happened earlier this year, or even like the Bitcoin stuff that's happening right now.

Alex: This is a case where you can train synthetic models to create additional examples that will help your machine learning to generalize a little bit better. That's a pretty exciting one. We also see it quite often with medical devices where it's so expensive to test with patients and things like that, but you want to build a machine learning algorithm once again to detect some sort of medical condition or something like that. So this is another really useful area where you can use DoppelGANger like this to consume the input of the sensors and create another data set.

Kendrick: Yeah. I guess to add one other example that maybe might be a better fit for Gretel Synthetics, the Gretel LSTM right now is anything that you have more free text or higher cardinality discrete features that the DoppelGANger is very focused and is very effective for numeric data or discrete categorical data that has a relatively few number of different values, like maybe 20 or 50 values. But of something where you want to have a bit more flexibility as a free-form text field that might be associated with medical records or something, it's possible that the Gretel Synthetics model might do better for that.

Alex: Last thing... Go ahead, Lipika.

Lipika: Ah, go on. I was going to say that the one thing they both do really well is handle a data set pretty much as you provide it. So you don't need to encode features, you don't need to do sort of all that plumbing to get things to work with a model that's very low cardinality, categorical, and numeric-feature first. DoppelGANger, you can just pass it in and all the plumbing is in there, so you don't have to worry about it, just pass in a Pandas Dataframe and call it a day and see what comes out. Not to say that you might not run into errors, but just to say that we've handled a lot of things.

Alex: That's what I was going to suggest, is try both. Why not?See which one works better for your use case.

Lipika: All right. I think we're ready to move to the next question. Are there particular types of time series datasets that DGAN does exceptionally well on? I will open the floor as I think about this.

Alex: From what I've seen, maybe talking about this, Kendrick alluded to the multivariate nature of time series. What that means is you have multiple inputs or sensors that worked, or stock prices that are all correlated to each other. And in our testing, that's where they approach the DoppelGANger. It worked really well, is learning to recreate that correlation that when your outside temperature heats up, that your air conditioning would flip on, things like that across these different sensors is one thing that it does super well.

Lipika: Yeah. I've also found that the shorter the series, the better DGAN does, because implicitly, if you're trying to learn a longer time series, let's say an entire year's worth of data versus just a week's worth of data or a month's worth of data and the patterns of that, anyone would be able to do that. Even you and I would be able to do that much better for shorter time periods. And so that's a natural occurrence in such models.

Kendrick: Another place where I think it really excels is when you have sensor readings or other readings that all the values fall within a particular range. So ECGs or temperatures, that sort of thing, that the way that the DoppelGANger model is structured, it handles that very well, whereas the Greta LSTM or other models might be better for things like prices or population or something that might just sort of continue growing without bound, depending on what the inflation is doing lately.

Lipika: Oh, and sorry. One of the things that we totally forgot to mention was obviously if you have data that has attributes or things that don't change with time, DoppelGANger is definitely the best bet with that. Especially when you have these other features like multivariate series or longer series, more complications, things like that. Certainly if you have attributes and many of them, DoppelGANger does quite well, and actually does incredibly well, went to it correctly, in maintaining the distribution of the attributes between the original and synthetic data set. That's a really neat feature.

Lipika: All right. I think we can move to the next question. "Does Gretel have a library of produced synthetic data that people can look at?" That's a great question. We have a lot of examples on our Blueprints GitHub repo, and we publish a lot of blogs where we use our models, so different types of data sets like Alex had mentioned. The UCI heart disease data set... Kendrick of course used the Wikipedia web traffic data set in his latest blog. And we keep publishing new data sets on Blueprints and in UpLogs. That's what I can think of. Alex?

Alex: Yeah. Most of our examples, I think as you highlighted, require you to generate the data set yourself, but you could go to docs.gretel.ai and click on the synthetic data notebooks, and it will walk through creating... We've probably got 25 different examples for creating different kinds of data.

Lipika: Awesome. Thanks for that question. "How are times series of varying length handled?" Kendrick, you talked about this a bit earlier. Would you like to discuss this in detail?

Kendrick: Sure. Yeah. Time series of varying lengths, there's still going to be a maximum length that the model can produce. And then effectively, we add an additional feature, an additional Boolean feature over time that is one until the end of the time of that particular time sequence, and then it turns to zero and is zero the rest of the time. And we just feed that into the model and let the model, model that as just an additional feature. That works really surprisingly well.

Kendrick: The original paper had a couple of different data sets with these varying lengths and the DoppelGANger model did better than any of the other models that they compared with to replicate the actual distributions of time, of the sequences that comes out of that. And so that's something that we are looking to implement and add to our open-source version here in PyTorch here in the next month or so, I think.

Lipika: That's awesome. Yeah. That's basically an additional time series. You're extending your multivariate series by one more and having it modeled on that, as well.

Kendrick: Right. You have an extra sensor, an extra measurement that is this time series; is this particular series still active, if you will?

Lipika: And that raises another point, which is that if you only had multivariate series that are Boolean, for example, you could model that with DoppelGANger, as well.

Kendrick: Right. The model does support discrete variables that are part of the time series.

Lipika: Yeah. For example, "Did you test positive or negative for COVID?" And we collected that information from everybody here across the many months since March 2020. All right. We can answer the next question now, I think. Why did you choose PyTorch versus TensorFlow 2 to implement DoppelGANger?

Alex: Insert loaded question.

Kendrick: I guess I can take this. I think there's lots and lots of great tools and a great, huge ecosystem around PyTorch in the same way that there is around TensorFlow. And I think we were really interested in learning more about the PyTorch ecosystem and really building up Gretel's knowledge and experience so that we are familiar with both pie torch and TensorFlow. The other models right now that are in our Gretel synthetics library are in TensorFlow, but now we have both PyTorch and TensorFlow, we have experience with both of those. And I think we can continue going forward that for whatever particular data set or our task that comes up, we can be able to utilize the best models, the best tools, from either TensorFlow 2 or PyTorch and not be forced into choosing only from one. And it was also just fun to jump in and learn a new way of thinking about neural networks and implementing them.

Lipika: And a new way of dealing with Cuda, as well, on the GPU.

Kendrick: Yes. And trying to make sure that everything, all your Python stuff works properly with Cuda.

Lipika: Yeah. Okay. I think we can move to the next question, if we have any. Ah. "What does the input to DGAN look like?" I think Kendrick had shown a little snippet of a data frame that was passed in.

Kendrick: Let's see here. So there's this... Is this the snippet you were thinking about?

Lipika: Oh, no. I was thinking about the one with just the raw data set.

Kendrick: Oh, back here.

Lipika: Yeah. In that data set, you can see time going down and measurements going across, so the different variables going across. The input to DGAN, Kendrick, doesn't like 2D and 3D. There's a lot of things that somewhat need to happen, but one of them is that if you don't have any attributes and you're just looking at features and you're interested in, for example, just T1 or temperature in whichever room, you can basically... And if you had multiple of these series for different homes, you could basically flip this for each of those homes and that would comprise your input to DoppelGANger.

Lipika: Now, Kendrick did do some sorcery here to get just this one series to look like multiple time series so we had more examples, and you can see that in the notebook that's linked here in the public research repo. I think everyone should have a link to that, as well.

Kendrick: Yeah. And I think for common use cases like that, some of that data processing, we are looking at how to generalize that and include that directly in the interface, if you're doing something that's a typical transformation like that. But we're still iterating a little bit on exactly what that should look like, and if you have thoughts about what would be most useful, most natural for you as an interface, we'd love to hear about that either on GitHub as an issue, or you can join our Slack community and chat there.

Lipika: Cool. Yeah. And one more thing to mention is that if you look at Kendrick's recently-published blog on the Wikipedia web traffic data set, the input to the model for that data set is exactly as it's provided on Gaggle, or maybe with a few minor tweaks, but it's basically a data set that already has time going across the top or something like that. Or maybe I misspoke, but if you have time going across the top and you have multiple examples going down, that's perfect. You can just pass that in as a data frame and see what we do with that.

Lipika: Cool. Okay. I think we can move. "How long does training typically take?" There was the comparison that Kendrick had of 0.3 hours when we changed the configuration a little bit so it runs faster. I think there we increased the batch size, is that right?

Kendrick: Yes, that's correct.

Lipika: Okay. And so if we didn't increase the batch size and we used... What is it? This batch size was a thousand, the previous one was... No, this batch size would've been just the size of the data set.

Kendrick: For the home sensor data, yeah. The batch size was the entire data set. And it's possible that maybe why maybe the model isn't doing quite as well, because we're not really doing stochastic, gradient descent. We're not getting the stochasticity of that. But that's another sort of parameter that we might tweak that could help provide better results.

Lipika: Yeah. That makes sense, I think. I recall you said that training 2000 E-box was less than a minute on a T4 for this particular data set, but then I think the times that are listed here are for the Wikipedia data set that was of size, I want to say, 50,000, and each series had 500 time points.

Lipika: Cool. Yeah. The best way to figure out how long training something takes is to train it, obviously, but we'd highly recommend using a GPU for this. It really reduces training time. In fact, I just took Kendrick's notebook and I moved it to Colab. I ran that in and I changed the run time to GPU. I ran that in again, maybe like a minute or two, depends on what type of GBU you get allocated when you start that up. But it was very quick for this data set, which is super small and purely for demonstration.

Lipika: Cool. All right. Do we have another question? Ah. "Does DoppelGANger only work for continuous time series data?" I think we answered that during the rest of the questions, but the simple answer is, it works for discrete data as well. If you have a categorical variable with limited cardinality or small cardinality, rather, or if you have a Boolean variable, DoppelGANger works for that, as well. You're on mute, Mason.

Mason: Welcome to world of mute. Yes. I think that's all the questions we have today. Awesome. That was a lot of fun. Really enjoyed that. For those of you that are here and made it all the way to the end or anyone who's watching this video after the fact, and you want to get some swag, you can go to the link on the screen and I'll also drop a link in the comments. It's grtl.ai. It's a short little cute Bitly link, /dgan/live. And we will send you a Gretel swag pack with a bunch of the stickers that we're currently having made. If you made it all the way this far, get some stickers. They're going to be fun stickers. You're going to like these stickers. Been spending a lot of time working on them. They are going to be... I'm not going to say they're going to be the best stickers ever, but they're going to be close. I like stickers. My laptop looks like I'm a three year old with stickers. So... Awesome. Well, does anyone have any final remarks or anything you want to say before we wrap it up?

Alex: Nope. It was a fun discussion. Love the questions. And maybe as two quick followups, one, we're going to have more examples, and some rooted in popular use cases we're seeing today, like financial data, coming soon. And second thing is if you have any questions, join our community Slack. We'll drop a link below and you can ask us questions there.

Mason: Fantastic. Well, thank you so much, Alex, Kendrick and Lipika for giving us a great presentation today on synthetic time series data with DoppelGANger. Keep your eyes open for more live events coming. If you're not already subscribed to our YouTube channel, go ahead and subscribe. We're going to start doing these technical presentations, deep dives, tech talks, AMAs at a much more frequent basis. So if you're interested in learning more from us, feel free to come back, and we hope that we'll see you next time.

Kendrick: Thank you.

Mason: Bye everyone.

‍

Deep Dive on Synthetic Time-Series Data with Gretel.ai

Video description

More Videos

Generating differentially private synthetic text using Gretel GPT

Anonymize Financial Data with a Fine-Tuned SLM

Video - Generate synthetic data for training LLMs and SLMs

Read the blog post

Transcription