Generating Synthetic Time Series Data with Gretel DGAN
Video description
Deep dive on generating synthetic time series data using Gretel's Doppelganger (DGAN) model
More Videos
Read the blog post
Transcription
Speaker 1 (00:03):
Welcome everyone. Today we are having another Gretel workshop and we're going to be doing a pretty fun topic that people seem to be really excited about, which is generating synthetic time series data using gretel's new model doppelganger or Degan. I feel like we refer to it many different ways, but you'll see Degan in our documentation, you'll see it in the config and stuff. So DA, okay, so what was doppelganger? Kendrick? Is it just someone's we've, I'll pull up the paper, but yeah, you want to just give a high level overview of what it is and what makes it different?
Speaker 2 (01:02):
So doppelganger is a model, it's a GaN, so it's a generative adversarial based model. So there's sort of two networks that are working both sort of adversarial but together combined to effectively generate data here. So this was a paper published a few years ago that introduced this doppelganger model that's really effective for time series. They looked at a number of existing models for synthetic generative time series data and added a couple of specific improvements. I think there be four or five specific sort of improvements that they added to this scan to help make it better for generating good quality synthetic time series. And so in general, the way I think about it and the way we've tried to use in our documentation and stuff is sort of doppelganger is this model sort of conceptual model and DEGAN is gretel's implementation of that. We have a PyTorch implementation of that that's open sourced and then that's also now available as of about a week, two weeks ago or so. It's now available through our APIs as well. So you can use the Gretel cloud runners to run this Degan model to train on your particular time series data and then generate more data. That looks like it.
Speaker 1 (02:19):
Awesome. Okay, that's really helpful. So doppelganger is the original model, the original conceptual paper and DEGAN is gretel's implementation of that model.
Speaker 2 (02:27):
Yes, I think that's a good distinction and way to think about it. Yeah,
Speaker 1 (02:31):
Good, easy, simple and easy to remember. What makes synthetic time series data more difficult? Why did there need to be a specific model around time series data versus the tabular data that we currently have through our LSTM?
Speaker 2 (02:44):
Right, right. That's a great question. So with time series data, you have, in addition to having maybe many variables, you might think about, you have sensors maybe in a house, internet of things type sensor data that you're collecting. So you have many sensors in the house that should have correlations between them. And so at the top level, the fact that the temperature in the kitchen shouldn't be that much different than the temperature in the living room is sort of like your standard tabular data set. So you need these between variable correlations, which is like tabular data, which we've done previous workshops on here at Gretel and such. But then with time series data, you also have this Timex aspect that not only are you measuring these temperatures at one particular time point at noon today, you measure that noon today, you also measured at 1:00 PM today you measured at 2:00 PM and so on. And so you also need to make sure that across time you have the right correlations and changes and such in your data that you don't expect the temperature to go from 50 degrees Fahrenheit to 80 degrees Fahrenheit over the course of an hour or something. Right?
Speaker 1 (03:58):
Yeah,
Speaker 2 (03:59):
That makes sense. You have both of these dimensions that you are trying to capture in a model so that you can replicate that in your synthetic data. And sometimes these correlations you might be interested in are very long. One of the data sets that was used in the original paper is this data set about Wikipedia page views and they were looking at time sequences that were over a year long and so you needed to make sure that certain pages get more popular around certain times of the year, and so you could have this yearly correlations and that the model needed to learn that and effectively replicate it in the synthetic data.
Speaker 1 (04:37):
Awesome. Interesting. That sounds really cool. I always learn a lot from doing these. I love just chatting with people about it. So everyone who's here, if you're in chat, go ahead and say hi and chat. Let us know you're here. Let us know where you're coming from. If you have any questions that you want to ask throughout the presentation, feel free to drop them in chat and I will pick them up and ask them as we find it appropriate. As always, we're training models in here, so we do usually have a little bit of downtime as we're waiting on the models to train. So if you ask your questions, it's usually they're going to be answered in that section. It's a great time to ask, but without any further ado, let's go ahead and generate some synthetic time series data. So this synthetic time series notebook generating oil prices comes directly from a blog that Kendrick wrote here, which is I think our most recent blog generating time series data with gretel's new Degan model. So Kendrick, do you want to tell us a little bit about the data? It looks like it's oil price history as someone who lives in Texas, I know what WTI is,
Speaker 2 (05:36):
But yeah, maybe you're the better person to introduce this, but yeah, so we were looking around for some time series data sets that be interesting to try out with the model and to give as examples. And so one common just general type of time series data is prices, financial prices for stock markets, interest rates, et cetera. And so this was sort of an interesting one that wasn't the sort of standard stock market. This is looking at daily prices going all the way back to 1987 of the
(06:12)
Price. Well, it's sort of like a reference price for a particular type of oil from two different locations, WTI being West Texas intermediate, which refers to both the sort of location and I think the style of the mixture of the oil that gets into chemical details that I don't know. And also for European Brent, which is sort of a similar thing coming from sort of the North Atlantic in Europe. And so one thing over time, sometimes WTI has been more expensive than Brent. Sometimes it's the other way around that these prices, while they sort of generally move together and the loosely the price of oil, they do have their own dynamics that are going on. If there's heavy weather going on in the North Atlantic or something, I imagine the Brent Price, the European Brent price might respond to that, whereas the West Texas price.
(07:09)
So we have this data set here with two different variables that are varying over time and we're both trying to see that those variables move around together, sort of the similar way that the training data does, the real data that we have, but also that they're looking reasonable over time that they don't have super sudden jumps. Although if you look at over the entire 35 years, there have been some really fully quick changes in these prices based on various global situations and certainly 2020 and I think this year probably 2022 has probably had some large,
Speaker 1 (07:48):
So this little blip down here, this would've been that pandemic time whenever oil price went below zero and everyone was freaking out about it.
Speaker 2 (07:57):
Yes. So that's actually a real price that based on the trading dynamics, it was actually trading at below zero at that time. Is that right? I noticed that that happened in the data and that's actually a real thing that isn't just some artifacts, some
Speaker 1 (08:11):
Yes. This was something
Speaker 2 (08:12):
You might clean out of your data.
Speaker 1 (08:14):
Yeah, there was that brief period of time when the demand for oil as everyone was staying in was so low that it did trade below zero for a couple of days, which is just absolutely bonkers to think about how does something trade at below zero prices, but it happened, so yeah. Okay, cool. So we have a really cool data set and this one is one of the, this is a notebook that you have written, so if it breaks, we're blaming you. It's not going to break by the way.
Speaker 2 (08:44):
And this notebook is linked
Speaker 1 (08:46):
In the blog. Yes,
Speaker 2 (08:47):
Linked for the blog post. If anyone wants to follow along,
Speaker 1 (08:50):
Where's it added? Usually? Yeah, right here. So it's in our Gretel blueprints repository. If you haven't checked out our GRE blueprints repository, it's a great place to just check out and see a lot of different use cases for synthetic data, not just synthetic time series, but all the things you can do with Gretel, tabular data, text-based data, my words, conditional generation, all sorts of things, relational databases. If you came to the last workshop, you're using some stuff out of there about generating relational databases. So it's a good resource for you to go ahead and bookmark or star in general just because there's going to be a lot of different examples. Cool. So we've installed PIP and the Gretel client pandas map plot lib. So if you were actually doing this without, so we're installing a lot of this because we're going to be showing you some graphs and stuff. Kendrick, we actually, if we were just generating synthetic data and we weren't doing all of our pretty graphs, would we need to install things like Torch and SciPi or is this because we're doing all of our nice graphs?
Speaker 2 (09:45):
I think the ones that you would need if you're running DEGAN through the Gretel cloud is just the Gretel client so that you can interface with our cloud and then pandas just to sort of stuff data and easily pass it into us. But yeah, the others map plot libs, SciPi and such, those are for the visualizations that we do here to give you a bit of a sense of what the data looks like and how well the model is doing at generating synthetic data.
Speaker 1 (10:15):
Awesome. Okay, so we're going to do some imports. Those worked. It's always good when the imports work. Let's go to the dashboard and I'll pick up my API key. So we're going to go ahead and log in with our API key Doesn't Cache it validates it. So we're logged in as me and now what we're going to do is we're going to download the synthetic data and create, well it looks like a data frame with some merged columns. So what is the purpose of us merging the columns here?
Speaker 2 (10:44):
So the purpose of merging the columns here is that because we're going to train the model on both columns at once, and so that means it'll be able to generate synthetic data that has both W2 A I and the Brent price in it. And so it should be able to generate that. These prices generally tend to go to move up together or move down together, but with some variations. And so to do that we need to put these all into one data frame in this what we term a long style format, which we may see in the config here where you have the date or a time column and then whatever variables you want to model are additional columns in that data frame.
Speaker 1 (11:29):
Fantastic. Cool. As you see here, we've downloaded it from this data hub, got the CSVs, it looks like that. And I assume that when you merge these that they need to have the same date key. Trying to merge stuff without it, without the improper date key would probably lead to a very upset model.
Speaker 2 (11:47):
Yes, that's certainly true. And there's a lot of subtlety and nuance around how you might structure a time series task that maybe we can talk about later or we can certainly answer questions about. But fortunately because we're getting both of these sequences from the same source, they already are set up so that they have the same set of dates that they have a measurement for from 1987 up through 2022. And that merge the WTI dot merge Brent, there just works and we don't have to worry about it right now.
Speaker 1 (12:18):
Isn't it nice whenever the live demos are set up properly?
Speaker 2 (12:22):
Yes.
Speaker 1 (12:23):
Okay. So now we're going to plot the entire 35 year price history, which if you recognize these plots, this is exactly what we saw in the blog. So we're just doing some pandas just plot. What is PLT? Is that from Pandas? What library is that
Speaker 2 (12:36):
From? That's Matt Plot Lib
Speaker 1 (12:37):
Matt. Plot lib. Okay.
Speaker 2 (12:39):
What's your favorite one? I think this notebook is using Matt plot lib. The figures from the blog I think are using Plotly, so they're a little prettier for the blog.
Speaker 1 (12:49):
Fantastic. So this is not synthetic data yet. This is just the data that we have. So we see that the data is, it follows general trends. It's not identical, but you could see that they have a general trend here, which
Speaker 2 (13:04):
Is right. And certainly at the 35 years that we're plotting there, the general shape is pretty similar.
Speaker 1 (13:11):
If this was a kindergarten assignment or an art assignment where I was making someone draw something like that looked on the board, I'd give them probably an A for this. It looks relatively similar. Okay, so now we're going to go ahead and plot the sequences. Okay, so this is something I know that we want to talk about, which is the max sequence leak. So do you want to dive in a little bit here about plotting 10 day sequences?
Speaker 2 (13:30):
Right. So we had the whole 35 years of data here and oftentimes with time series that's what you have. You have the historical record of some phenomenon in this case oil price histories, but the degan model being a neural network model and such, we're going to train it with stochastic gradient descent. That's all part of what's happening underneath the hood, but that means it's going to want a lot of training examples to update the weights properly. So we need to figure out how to take this 35 years and turn it into many training examples. And so the most common way here, the easiest way a lot of times is to just split this into chunks of some length. So here we're going to use chunks of 10 days long and there's a lot of dynamics about what type of length you might want to use.
(14:28)
And a lot of this will depend on both the data that you have and what sort of downstream task you might want to do. So here we use 10 days because that roughly corresponds to two weeks of trading data. We're not worrying about holidays when there wasn't a trading price and all that for this analysis, if you're doing something more sophisticated with prices, you certainly might want to work on that. So yeah, basically we take that 35 days and chunk it into a bunch of 10 day long chunks. So we're going to plot a couple of those here just to give you a sense, this is what the model's actually seeing and it's training in some sense. And these are the sort of shape of the synthetic data that the model will produce as well.
Speaker 1 (15:10):
And then we have these plot subsequences, so I'm assuming this is the 20th sequence of 10, the 500th sequence of 10, the 731st sequence of 10. This is just us sampling a little bit to kind of show you they kind of follow. But as we can see here, and again this is real data yet as we said, these look almost identical, but they're not quite identical. This one looks like it peaks goes down a little bit more, maybe a little bit delayed, but it's there. It's close, but it's not identical, which I think is definitely worth pointing out.
Speaker 2 (15:43):
And I think we, as you can see on the dates, maybe a little hard to see in the presentation, but between the 2007 sample, that's the second one there, the Brent price is higher than the WT I price, but the earlier sample from the 1980s had the WT I price being higher than Brent. And so those switched at some point and may change back and forth a little bit depending on whatever's sort of going on in the general oil world. All those details that I don't know about.
Speaker 1 (16:16):
Yeah, actually I hadn't even noticed that. That's a really cool thing and if you're having a problem reading it, just turn your head to the side. It works there. You can just do
Speaker 2 (16:24):
That.
Speaker 1 (16:26):
Okay, so the next thing we're going to do is we're just going to train the model. So we're going to basically create a temporary CSV file. This is how we upload the data to the Gretel API. We're going to create a project inside of the Gretel API, which we will be able to see here in our console. We're just going to call it degan oil, create True. If it's not there, create it. If it was there, we could just not have to worry about it. Then we are going to basically pull the model as it's training and we're going to change a couple of the configurations that sees here. So the config, we're going to read the model config from time series. We're going to give it a name and then we're going to say generate, I'm guessing this is just say, Hey, generate 10,000 records.
Speaker 2 (17:07):
Yeah, that's correct. One minor note we're saying we want to generate 10,000 records here, but for the DEGAN model, what that means is we want to generate 10,000 sequences.
Speaker 1 (17:20):
Okay, so it would be a hundred thousand records.
Speaker 2 (17:22):
So we're actually going to have a hundred thousand rows in the generated in the synthetic data frame that we get back. And so the first 10 rows correspond to the first sequence that was created.
Speaker 1 (17:36):
Okay, cool. I did not know that. That is a good thing to know
Speaker 2 (17:38):
And though there's some places where you see different numbers based on that, that might be confusing, but that's sort of what's happening that sometimes we're talking about the number of rows and sometimes we're talking about the number of really of examples of, so yeah, just to be aware of that.
Speaker 1 (17:55):
I realize I should have hit start before we started talking about it True,
Speaker 2 (17:58):
Because this is going to take a little while.
Speaker 1 (17:59):
This is going to take a little time. We should have just gone through it and had it going. So then we're going to create the model object, pass it in the model config data frame to CSV. We're going to take basically store it as a CSV file index equals false. We're going to model, set the data source to the temp file, and then we're going to submit it to it and pull it. So as we're going to see here, we can see this is the full configuration. We didn't show you this. We try to use reasonable defaults for all of our configuration. That's one of the things that we do here at griddles. We try to make it where you don't have to for getting started, you don't need to tweak every hyper parameter and such. Now you can do that. We have great documentation that shows how to do that, but we try really hard to make it where we get it right for you the first time and then if you want to refine it, you can.
(18:49)
But that's one of the things that I think we're really proud about that we work here, is to try to make it where every synthetic generation works and you get reasonable results out of it. So yeah, now we see IT training, we're going through all of this and it is training and this is going to take a little bit of time. Okay. Kendrick, earlier you had mentioned that there were a couple of nuances around setting up synthetic time series data and such. Do you want to go into some of those while we waiting on our model?
Speaker 2 (19:25):
Yeah, we can talk a little bit about that. So I guess maybe the top level one is just sort of the format that you're uploading a data frame, but we need to interpret it as this time series or really as a number of different time series. And so the format actually, if you go look at the config that we printed out there and the output, if we can look sort of at the, oh, it'll be at the bottom after the prm. So yeah, sort of loosely the PERS is all sort of model specific. There's the attribute columns, feature columns, right? Six.
Speaker 1 (20:06):
Scrolling in notebooks is hard,
Speaker 2 (20:09):
Right? There's too many scroll bars, multiple scroll bars is challenging. So there's the six columns. The generate one is about the synthetic data, so that one's not really part of it, but these are all parameters so that you can tell the Gretel API how to interpret your data properly as a time series and potentially how to break it up if you need to do the chunking. And so there's a number of different ways to do this, which we have documented some of the documentation, but I guess I'll hit a couple of the highlights. One is this DF style. So this is sort of, we're terming a style or format for the data frame. So this long format, which is what we used here has every time point is one row in the data frame. So there's sort of the date or a timestamp of some sort, and then the various variables that you measured at that point.
(21:11)
There's an alternative style that you can use, which is a wide format where each row is the entire example that you have. The different time points are each a column. I don't actually remember if we are using which format we're using for our other example, it should be the same. I think it's the same, it's the same. We don't have an example for that, but there's some flexibility there about sort of what format you might have and how to set it up. And maybe the other config parameter here we could talk about a little bit is this example ID column. So here we use null for that. We're not passing that in. And so that means that internally we're going to automatically split your data into these 10 day long chunks. And that's based on the max sequence length, which is up further in the config, but we set that to 10.
(22:02)
But if you have some other data or you're trying to do a different sort of type of task here, so maybe you're doing some financial modeling with prices of some sort of assets or stocks or something maybe from New York Stock Exchange, and you might set it up similar to this where you want to generate many different stocks at the same time and be able to see and in your generated data have correlations and changes so that that, oh, tech stocks tend to all sort of go up together or go down together maybe. But stocks that are related to, I dunno, maybe energy aren't too correlated with tech stocks but are correlated together within that energy sector. But you could also set it up where maybe because just so you can have enough training data, another big sort of trade off here is needing enough training data where you just want to generate one stock at a time. You just want the model to produce an example of maybe 30 days of price history for a random stock from the New York Stock Exchange. And so there you could set the example ID column as the ticker symbol so that each training example that the model sees is a different stock. So it sees what did Apple stock do for your 30 days time period, or what did Google stock do for the 30 days, or what did Visa stock do for the 30 days?
(23:31)
This is all a little bit amorphous talking about it generally here, and I think probably an example would be a lot easier. So maybe we can go into more of this in a future workshop or something, but also happy to answer any questions about this, but there's a lot of flexibility and power in these parameters and we're certainly interested to hear your feedback as this model is definitely an open beta. If there are other sort of styles of data or format sort of things here that would be helpful for your own usage, we'd love to hear from you about those.
Speaker 1 (24:02):
Definitely. Yeah, let us know in the chat if you've got any questions or any feedback. No, that's really cool. There's so much you can do with this. I was reading through the documentation for it and I was like, oh, this is really cool. I wonder how I can use this. I guess a quick question though is are we limited right now to only synthetic time series data for numeric data or can we do textual based times? That's a big combination, textual based time series data. The example that I always go back to is if I took logs off of a server like access logs, could I generate synthetic access logs?
Speaker 2 (24:41):
So as often the answer is sort of yes and no. Depends
Speaker 1 (24:49):
On how hard you want it. Yeah.
Speaker 2 (24:51):
So yes, in that we do support categorical variables. So if the various text parts of your log messages are fairly structured so that you see
(25:05)
Debug, yeah, info warning, debug for the type of the log, and then maybe the actual messages you see are from 20 or 30 different types of messages, but with maybe some additional numeric columns of what time this happened or what process ID it came from or that sort of thing, right? Yeah. Then that's definitely something that this model can handle, that we can handle a discreet or categorical variables. The one note about that and that the caveat right now is that you need to have already converted that variable into an ordinal encoding. So sort of 0, 1, 2, 3, 4, 5, up to the number of K for the number of unique values on it. If you just try and pass a data frame with a text column into it, the model will crash, which I think the first time Mason tried this for this process that happened. And so that's also sort of a future improvement that's on our list to handle that automatically and give their error messages and stuff.
Speaker 1 (26:09):
That is one of the things I do like to do whenever I do these workshops, especially when I have someone on, is we use the baseline model and then I go find, excuse me, then I go find my own dataset and I'm like, okay, how easy is it for me with minimal knowledge to be able to swap this out and get it working? And actually I think Kendrick and I are both, it was relatively easy. It was really easy for me to get another thing. The model learned, the patterns, the one that we're going to show you later is a site visitors map. And earlier Kendrick had said, when we originally were doing the Degan stuff on the Wikipedia traffic, there are trends that happen when webpages are visited and stuff. And we saw that data come through and that was really nice. Okay. We are almost, well, I can't say we're almost at the end.
(26:54)
Every time I've trained this, it's been in the 10 minute range, but oh, oh, nope. There it is. We're good. We got it. We made it. I knew that if we just rambled on long enough, we'd get there. As someone who used to, I say this joke every time, and for those of you that tune in every time, you're going to have to hear this joke for the next three to four years until I get tired of it. I worked in DevOps for a while and I did a lot of builds, and whenever operating systems are building or package or building, sometimes it takes a while. So you build it and then you're just kind of sitting there waiting on it, and it feels like model training is very much like that as well. It's like, oh, what are you doing? Why are you on your phone? Well, it's building. We're waiting on it to build, and we're waiting on it to train, but looks like we're good here. Took about nine minutes on a pretty relatively large dataset, so that's pretty impressive. And now we can go ahead and we can read the first 10,000 synthetic examples. So we have date WTI price, Brent Price.
(27:49)
Okay. So we've got some dates.
Speaker 2 (27:51):
Also good to mention that we have this example ID column here. Yes. So that's sort of this new column that added to the outputs, but that's telling you again that the model, the Tigan model is generating 10 day sequences at a time. So all the rows with example Id of zero and there should be 10 of them, and I think maybe the next cell might show if you just scroll down to the next cell did run that, that should just show us the first. So I think
Speaker 1 (28:25):
Yeah, we can also click on the little magic fancy, the little magic button and it works, and then we can do all the pages. Nice. But yeah. Okay. So you're right. So yeah, when we do it from the zero to 20, we see 10 with zero and 10 with one. So these are all one chunk.
Speaker 2 (28:41):
When you're doing analysis and doing whatever downstream tasks you might have, you should be looking at each of these 10 day chunks. You shouldn't really combine them back together into one thong thing because there's not going to be any correlation between that ninth column or the ninth row and the 10th row. The 10th row is sort of a new version of history in some sense.
Speaker 1 (29:03):
And it's restarting that we can see how this goes from 51 54 to like 1714. Correct. Okay, cool. We actually had a comment here. Cameron Garrison says, one thing that would be nice would be to have a verbose, non verbose training flag. 10,000 print lines is too many. It is a lot. Yes. But I think we can, yeah. Okay, I see what you're saying. Yeah, we can take that feedback back and look into it. So we've got that, and now we have a whole bunch of helper functions, which is basically just a whole bunch of plotting math. I don't know how much it is worth going into. This is just like plotting math, right? Yeah, I don't
Speaker 2 (29:43):
Think we need to go into it. This is both some fancy plotting, but also trying to get some nice colors and things in it so that the plots look pretty.
Speaker 1 (29:52):
Okay. Yes. So if you're used to our normal doing stuff with the LS TM model that we have, our synthetic data quality score report, we don't currently have that for degan. It's coming. So for right now, what we're doing is we're using this code to basically create that report for you inside the notebook, which I could definitely see some people liking. Okay, so here we are with our, this is what the correlation, the, what is this one called? Correlations.
Speaker 2 (30:21):
Oh, right.
Speaker 1 (30:23):
What is it called? It's in the, I always forget the names.
Speaker 2 (30:26):
Is it just called field correlation?
Speaker 1 (30:28):
Yes. Yes. I think it is the field correlations.
Speaker 2 (30:30):
Yeah. Yeah. So in the synthetic quality score that we would report that we would use for tabular data, we have the field correlations, which just sort of tells you how these variables are correlated and then looks at the difference between those correlations from the synthetic data and the training data. And so this heat map should look very similar from that. And the good news here is that all the values are very near to zero. And so the correlations both in the training data and the synthetic data between the WTI and the Brent Price are very similar, that the difference is very close to zero.
Speaker 1 (31:14):
Yes, you've definitely seen this before. If you've used our other one, it's usually a much bigger one, but because we only have two things we're looking at, it's a really small, it'll be a little bit bigger in the other one that we do, which I think has three. Okay. So now we're going to go ahead and plot a histogram of the distribution, so the WTI price and the Brent price. And these are just the different values. What was this distribution values? So what the training versus synthetic data look like. And as you can see, they're all, you want them as close together as possible, kind of sort of. It's okay that this one's a little bit higher because if it was exactly the same, then it would be spitting out the exact same data and we didn't actually generate synthetic data, which would be against the point of what we're trying to do here. See, I'm learning. Yeah, so we have that right there. That's a boring map. I like this one better. I think. Oh, no, no anticipation. That was the prep one. The one that I liked the most is, oh, no, yeah, it's this one. The auto correlation. Yes. So the real versus the generated. So as we can see, they are a little bit different, but they have the same trend. So if you saw this, Kendrick, what would you say about this is good quality, you happy with this?
Speaker 2 (32:36):
Yeah, I think this is sort of okay-ish quality. I think one of the challenges that we have had with the Decan model is that it can be a little bit unstable. And so rerunning the same model with the same config parameters may get us a little bit different result here so that the generated data has a little bit higher auto correlation at the low legs. And as a quick explanation of what this plot is, so auto correlation here is we're looking at the correlation between the price of, say, the Brent price at day in compared to day in plus one. And so that would be the, I think it's actually the zero on the xaxis here. And then, so when you're just looking at adjacent time points, the correlation's pretty high. That's what you'd expect with prices, you don't expect prices to change to suddenly.
(33:36)
But we do see that the real data has that. The correlation there, the Pearson correlation is 0.5, 0.6 or so, whereas in the generated data, it's a little bit lower that there's not quite as much tendency for the price on two adjacent days to be near. And then the rest of the X axis here is showing how that correlation changes as you look at a longer and longer time lag between two data points. So if you look at data that's a week apart or so, five days apart would be from one Monday to another Monday that would correspond to the X value here of time lag. I think of four, the way that the labels are here, and we see that there's a lot lower correlation there as you'd expect that prices tend to move sort of as a random walk, maybe there's a lot of technicalities in there as we've recently learned. But yeah, this is one view of looking at how well did the model capture the temporal aspects of my data? Did it actually capture how these variables are changing over time? Whereas the other heat map we were looking at was more just looking at do these variables at any given time point have the right relationships between them?
Speaker 1 (34:49):
Yeah. Okay. I mean, that's pretty close. I mean, it's not perfect, but that's close enough. It didn't go the other direction.
Speaker 2 (34:57):
Yes, that's true. That's
Speaker 1 (34:58):
True. Yes. So close enough, like you said, rerun it a couple of times. So let's look at a couple of just random sequences and okay, this one looks pretty good here. Oh, this is WTI versus Brent Price, this one.
Speaker 2 (35:12):
Yeah, the middle one is maybe a little bit more suspect. And this sort of happens, and again, there's sort of potentially different parameters you could tune or tweak to improve this on your particular data set. And depending on what your downstream task is, ultimately what you want to do with the data.
Speaker 1 (35:32):
Yeah, makes sense. This one's pretty close, I would say there's a pretty deep, yeah,
Speaker 2 (35:36):
That looks pretty. If we sort of look at that as compared to the three samples that we looked at earlier from the training data as a human, maybe we couldn't really tell the difference. The one that has a couple different changes between which one's higher or lower in the same 10 days is maybe a little more implausible.
Speaker 1 (35:56):
Yeah. Okay, cool. So that is the oil one. And then what I do is, like I said earlier, is I go through, and I probably should have started running these a little bit earlier, is I always redo it myself. Basically take the notebook and see how quickly I can adapt the notebook to run with my own data. And by my own data, I mean something that looks interesting on kagel, I just go and find something that looked cool and I went with site visitors downloads, which is pretty fun. So as you can see, most of this code is almost identical. I guess one of the benefits of using gretel and stuff is that the SDK is very similar. Similar, actually the part where we actually train and submit the model is almost identical across all of our models. It's mostly just config changes. So you have a lot of similarities there with the code that you're writing.
(36:44)
Let's go ahead and grab our API key again one more time, and I can't wait until we get to start playing with trainer a little bit more, which is, if you've watched our blogs, we have a much simpler version of our SDK that is in development right now that takes these 30, 40 lines and turns it into two lines and it's awesome. But yeah, so what we ended up doing is we have a dataset that is traffic to a website. I don't even know what website it is. And some of the stuff that we're going to do here that you'll see that I ran into, which is what Kendrick was talking about, changing it into an ordinal. Was that correct? Is that the correct term?
Speaker 2 (37:23):
So like a label encoded or ordinal? ENC coded,
Speaker 1 (37:25):
Yes. Ordinal, yes, yes. Yeah. So one of the things I had to do, as we saw with the text data, was also people who put commas in their numbers in statistical data sets. Please stop one. I'm not a pandas expert. So whenever Kendrick was like, oh yeah, you could just add thousands, comma, I was like, well, that would've been great. You should have seen the code that I had here to fix that. And then we convert the date into a date that where the format looks similar to this year, month, day format we're going to draw. So this also has a couple of other random columns in it that we just didn't need. So I don't need the day, I don't need the row, I don't need the day of the week. These are cool data sets, but we didn't really do it. And then we did a convert date, which was basically just running my function on it.
(38:13)
And here we go with this is the page load's, unique visits, first time visits and returning visits. So we have a lot more columns in this dataset. I don't even remember if I train all of them, but it's a much smaller dataset, so it should go quicker. Looks like I did. I did unique visits, first time visits and returning visits, and I didn't do page loads or that's not what I decided to plot. And then we plotted it, and as Kendrick had said earlier, we see there are trends in when people go to websites and stuff. So we kind of see these really thick line trends because there's so much data here, only five years worth of data yet it looks all,
Speaker 2 (38:53):
But there's definitely yearly, some yearly trends as well. Yes, definitely that lower traffic around the new year and such.
Speaker 1 (39:01):
Yeah, everyone's having fun. And then we're going to go ahead and we're going to use the same sequence length of 10. We didn't change that, and we're going to just kind of track unique visits, first time visits and returning visits. So this is our training data looks pretty cool.
Speaker 2 (39:18):
So I think one thing to note here is that there's definitely weekly patterns in this that I think what you're seeing on the, I don't know what 2015 four dash four means in terms of the date, but April 4th? April 4th is that April 4th. But I expect if we went back and looked at the calendar, that would be a Saturday or Sunday that this website has lower traffic on Saturdays and Sundays and higher traffic during the week. And so we will see if the synthetic data is able to replicate that
Speaker 1 (39:50):
April 4th. I'm going to check it, see what it was a Saturday. Good. Wow. Okay. Yeah, good guess there. It makes sense. You've seen all this data, you know what the trends would be. But yeah, that was a Saturday. I could not pull that out of the top of my head. I don't remember where I was April 4th, 2015.
Speaker 2 (40:08):
Right, right. So yeah, with max sequence length of 10 here, we should be able to pick up weekly trends, weekly core patterns that are happening in the data, but we won't be able to pick up a yearly pattern that the fact that 365 days apart, you tend to have a little bit higher correlation than looking at 180 days or something. But that's just because of the max sequence length of 10 that we're choosing here, and we're choosing that max sequence of length 10 as opposed to something larger, partly because we want to have enough training data, enough examples for the model to learn. We could certainly try it with a max sequence length of 30 or 365 or something. But then with the data that's available in this data set, that the model may have a hard time learning. There'd be sort too few examples
Speaker 1 (40:57):
At that point. Yeah, we'd only have, if we did 365 and this data goes from five from 15 to 2020, maybe it looks like it might go to 2021. So we would have very few, if any, and that would probably be really difficult. So that brings up a good question though. Is there a kind of max limit to sequence length? So with degan, I've known that if I use more tokens in GPT, it uses way more memory resources. So with degan, does it do the same thing? Is it a direct correlation between sequence length and memory resources?
Speaker 2 (41:32):
It's not a direct correlation. And so partly there's these two parameters, the max sequence length, which is sort of the length of sequences we're training on, but there's also this parameter called sample length, which is sort of a trade off between having a much bigger, lots more parameters in the model, but also maybe slower to train, takes longer more data to train versus sort of simplifying the model. You can loosely think about as sort of sharing weights in some way and there's more documentation about that. But in general, you probably want, let's see if I did this right here, I think you want to have max sequence length divided by sample length, and that needs to be an integer. So that's a minor thing, but I think it should throw a pretty clear error if that happens. You want that to be in the sort of five to 20 range will probably give you the best results.
(42:32)
But in terms of how long of a sequence can this model, so again, this Wikipedia traffic data, which is again website visitors that used a max sequence length of 550, and so that's quite, quite long and it does a really nice job. The model does an incredible job at picking up the weekly correlation of that data throughout the entire length of all 550 days that it's looking at. Again, if you go look at the open source or the first blog that we did about this, talking about the PyTorch implementation that we open source, those are some of these auto correlation figures, and there's more details there about that data and about how the model works there.
Speaker 1 (43:22):
Is there a certain image here you would like to look at?
Speaker 2 (43:25):
Yeah, I think those three images are sort of the next one down from that. This one, that's the same sort of auto correlation plot that we showed earlier, but the X axis goes from one to 549 basically. And so the jitter sort of thing that you see on there is the weekly pattern, and then there's the yearly pattern, which some of these models picked up. The left two models picked up in the synthetic data, which is the orange line here, picked up that at the year mark 365, there's that higher correlation.
Speaker 1 (44:01):
Yeah, it seems like, so TF one, I assume, is the original TensorFlow implementation?
Speaker 2 (44:06):
Yes, that's the original implementation that is up on the repository from the original authors. And there was maybe some things with some slightly different parameters that helped make the PyTorch implementation a little bit more stable. But again, that's something like if you run that TensorFlow implementation with the same data, same config five times, three out of five, four out of five times, it'll look like the left two plots. It'll be great, and one out of five times the model sort of gets a little lost during the training and doesn't do as well.
Speaker 1 (44:37):
Yeah, it kind of looks like it wandered off the path here at the very end. This was the bell rang, I'm leaving class, it's five o'clock, I'm going home. I'm not doing this anymore, which I admire that in a computer. Set your boundaries, know your limits. Okay. So this one, like we said, is a lot quicker generating, which is really nice. So now we can go ahead and read the synthetic data that we have here. So we have it, and again, it looks very similar to our old one. It has all of the different looks. Like I did train on page load. Well, yeah, I submitted it. Of course, it trained on it. So page loads, unique visits, all that. It has the IDs, which will be groups of 10 again, right? Yes. Okay. Is there anything you want to point out here or should we move forward?
Speaker 2 (45:23):
Yeah, I guess the one thing maybe to point out that you'd also asked previously when we were chatting about this was that we can see that we have floating points for page loads and unique visits and stuff. So even though input data was integers, the model doesn't sort of round that and produce integer outputs. It just produces float outputs. And so here at this point, we'd recommend just doing some post-processing of using NumPy enter or an appropriate conversion. If you needed things to enter germ,
Speaker 1 (45:54):
Would you round or would you truncate? What do you think? Does it make a difference?
Speaker 2 (46:00):
I don't think it's going to make a huge difference on this. I would probably just round it, but yeah, I don't think it would make a,
Speaker 1 (46:08):
Yeah, I didn't think so, but sometimes I'm wrong. It has been known to happen. So we have our chunks, again, we have 10 chunks. We're going to run the little helper functions, and now the moment of truth, our correlation. Okay, so like we said, we have a lot more things here and things, while everything in the last one was super yellow and basically perfect, this one got a little bit closer into green, which I feel like if I came on here and everything was perfect, you'd kind of think I'm lying to you. But this is not, I mean, this is still 100% within the realm of a hundred percent acceptable, but as we can see, it's a little bit greener. The unique visits versus returning visits seems to be, yeah, the highest, oh, those are the same things, just different access on the chart. Cool. Let's go ahead and go here and look at our histograms. Interesting.
Speaker 2 (47:12):
And so here is it the green is
Speaker 1 (47:16):
Synthetic.
Speaker 2 (47:17):
Synthetic, okay. Yeah. So the model's not producing quite as peaked of a distribution here.
Speaker 1 (47:26):
Let's give it a lookie.
Speaker 2 (47:27):
That might be something that training it for a little bit longer, a few more epochs perhaps could help with that.
Speaker 1 (47:34):
Okay. But the auto correlation, okay, so it kind of bumps here, but it looks pretty good,
Speaker 2 (47:41):
Right? So here in this dataset, the auto correlation's a little bit more interesting because we see that
(47:50)
The seven day out of correlation, which actually corresponds to the X equals six on the X axis, we didn't get the labels really good on that, but my bad, we see that the correlation actually goes back up. And again, in the real data and the synthetic data, the model is able to capture that somewhat. It maybe doesn't go up quite as much as in the real data, but it does recognize that, oh, even though numbers move around a lot like one and two days apart, but once you get back to six or seven days away, you're back to the same day of the week again. It should sort of look more like it was a week ago. And so the model is picking up on that somewhat.
Speaker 1 (48:30):
And now we look at our sequences. So that looks acceptable. That looks pretty good. I think that looks pretty good. It is a little bit weird for the unique visits to cross first time visits, just as someone who knows how web traffic works. That's weird. But I mean, hey, this looks, yeah, I'm happy with that. Now the question is, do you want me to run it again? This is a short training model. We've got time to run it again and just see what we see.
Speaker 2 (49:06):
Sure. Yeah, let's run it again and see how it looks, and we can maybe answer the, if we have any questions.
Speaker 1 (49:15):
Yes, if we have any questions, please feel free to drop them in the chat. If not, you have to hear Kendrick and I fill space.
Speaker 2 (49:23):
We have a question.
Speaker 1 (49:24):
Oh, we do, you're
Speaker 2 (49:25):
Right. We do have a question. Yes,
Speaker 1 (49:27):
We have a question that we had earlier that was pre-submitted. So how much does the quality of the input data from which the synthetic data is produced from affect the quality of the synthetic data?
Speaker 2 (49:40):
Yes. So I think this is a really good question and in many ways gets at the core of synthetic data and what does it mean and what does good synthetic data mean? But in general. So for a model like this, it's very much what you put in is what you're going to get out of it. So the model is trying to fit the training data, the input data that you gave it. And so it's going to have the same sort of noise or cleanliness of that model would be reflected in the synthetic data that you get out. And perhaps one of the most challenging places where that may come up with is if you have very few examples and you just don't have as much variance as many examples of unusual behavior. If you're thinking about, I dunno, website visits here, since this is the dataset we're looking at, right? If sometimes your website goes down for a few days, now you have zero visits. But that shouldn't happen that often.
(50:44)
Hopefully, hopefully depending on what kind of website you're running, what kind of hosting you're doing. And so if that happens, how often that happens, if that happens 1% of the time in your training data, you'll probably see that happen about 1% of the time in your synthetic data. And now depending on what you're trying to do with this, that may be good or that may be bad. And this gets back to the downstream task that you're using your synthetic data for matters a lot for different choices you might make. And this is an area that I think we're sort of actively exploring. If there's places where the synthetic data you want to create, you're trying to amplify, get more examples in a task where you don't have many examples, but maybe you want to change the distribution of those a little bit. Maybe you want those downtimes to be more common in your synthetic data because you're trying to detect them with some algorithm so you could get a nice alert or something perhaps. And so we're exploring some ideas around how to draw the data that you want to see, draw a distribution that you'd like to see in your synthetic data in some aspect of it, and then have the model generate to that sort of visualized version of it. But that's an idea that we've been thinking about and I think maybe, although I won't promise anything here, Mason might talk about it on a future workshop here.
Speaker 1 (52:11):
Yes, it's definitely one of the really cool ideas that we've been throwing around and we definitely want to talk about it in a workshop. When is the question? It could be as early as next month, or it could be, who knows? For those of you that do watch pretty regularly, we are going to move to monthly workshops. Just timing, trying to do two a month is exhausting. You should do it. And there's just other things that I'm working on right now, but for now, we're going to go back to once a month on that. I guess the question I have while we're still waiting on this is, so you had mentioned that your website shouldn't go down that often, which is good. Oh, that just finished. Do I finish my question or do we, oh, finish the question. It wasn't supposed to do that. It was supposed to keep going. Bad model. But what if I wanted to simulate more of that instance? So say I have an event that happens very rarely in my synthetic time series data, but I actually need more of those instances so I can build either training models around it or even tests around what do I do whenever this happens and I can play that data back. So I know that we support conditional data generation and augmentation with GPT and stuff. Are we there with degan at the moment?
Speaker 2 (53:24):
Yeah, so that's a great point. So no, the conditional data generation is not something that's currently supported with degan, but it's definitely aspects of it or something that we can definitely do. And so just to give a little bit of taste of that, we also degan, and one of the really great things about doppelganger, the underlying model setup is that in addition to these variables that change over time that we've seen examples of today, you can also have fixed variables that just describe a particular sequence. So you could definitely set up your model such that we have the, maybe it's 10 day sequences or something that from your website data, and then you add this additional attribute that you label each 10 day example with whether the website went down during that time or not. And then that's an attribute. You can encode that in the model, you can give in the attribute columns and all that. And then in the future, we should be able to do to conditional generation so that you could then seed and say, okay, I want to have 50% of my 10 day sequences where the website went down and 50% of them where they didn't go down, even though that was not the proportion in the training data. So that's definitely a potential thing that we'd definitely like to support and is on the roadmap.
Speaker 1 (54:52):
Yeah, 50% downtime reminds me of my first website. So whenever I was in college, I decided I wanted to learn how to do ops stuff. So I paid for business class internet as a broke college student and got static ips, hosted my own DNS servers and website off of a single machine running virtual machines. And my roommate was notorious for tripping over the cable and unplugging my website. It says probably a little bit more about me than anyone wants to know, but people are always like, why don't you have a home lab anymore? I was like, because I used to have a roommate that would unplug the website. That's why I don't have a home lab anymore, even though I do want to rebuild one, especially with all the cool raspberry pie stuff. But anyway, we're done training. We've got our data. I've already downloaded it. Let's go ahead and look at these graphs and let's see what we have the correlations. Okay,
Speaker 2 (55:45):
This one maybe didn't go so well.
Speaker 1 (55:47):
Maybe this was the reverse. Maybe this one went the other way. Let's see what we get. Oof. This one took, its okay, but the first time visits one actually looks really close, but this unique visits one definitely kind of gave up at the end. Okay, so we're demonstrating that it's still an open beta and that it's a little bit cranky. That's again, that correlation, which doesn't look bad, but I'm just going to say that looks not terrible. Like,
(56:20)
Okay, this is where it went. Okay, so in here, this where it just went like, boom, this looks pretty good. Okay, we're running out of time, but I have more questions, which is really fun and also irritating. At the same time, say you generate all of this synthetic data. We've generated 10,000 sequences, a hundred thousand records, and is there a way you can look at it and be like, okay, we know this first one, sequence five is kind of junk, but sequence. What is this 5,000 and 9,121? These are actually pretty good. They're not perfect. How would you go about parting it out? Can we use just different chunks of it and then, is that even a good question?
Speaker 2 (57:07):
Yeah, I think that's definitely a reasonable thing, but again, I think you need to have some idea of what is good, what's valid, what's bad or not. So maybe in this case, any sequence where the number of, let me see if I get this right here. If the number of first time visitors is greater than the number of unique visitors, that's impossible. That the right direction, right?
Speaker 1 (57:34):
Yeah, it's It's highly improbable. Impossible. No. If no one's ever heard about you before and you were suddenly on a Super Bowl commercial, then it's possible that it messes you up. Ask me how I know that it's a real story from my life.
Speaker 2 (57:51):
I can see a future workshop that's just Mason's stories.
Speaker 1 (57:56):
People have been asking me that for years, and whenever I used to do shows, people used to come, they would just go me on. It's like I could tell you about what happens when you have a website and then you decide to buy a Super Bowl ad, but you don't decide to preemptively scale.
Speaker 2 (58:11):
Probably lots of downtime.
Speaker 1 (58:13):
Yes, a very yes. It surfaced bugs that we did not know existed because we're not used to that. So what happens if everybody just hits the homepage? Anyway, I'm about to go on a tangent. We're going to stop it. There we are. Good. I think that was a really cool workshop. I'm really happy with how this turned out. So that's how you generate synthetic time series data. A couple of just housekeeping things before we go. As always, we have a giveaway, so if you would like some free swag, some stickers and stuff, please go to either this QR code, you can scan it or go to gr tl.ai, gretel without vowels. That's the short link workshop, and you can just fill out a Google form. Give me your name, your email address, and answer like a question, and we'll ship you some swag. As you can see from the ticker that's happening down here at the bottom, which I'm going to leave this up for a little bit.
(59:11)
You can go back and watch the VAD if you miss this because this will remain open for a week. You'll have a whole week on October 12th is when I'll close that form. So if you're watching this after the fact that we go live, you can still get some stickers up until a week after. One thing I do want to just point out really quickly if I can find it, we now announced and launched our own community Discord channel, which if you've been watching the ticker at the bottom of the screen, it's giving you the link to it. But if you want to join us and just continue chatting with us or you have questions or if you run into problems that you need help with, please feel free to drop into the Discord channel and reach out. We're going to start trying to do a little bit more content in the Discord as well, so definitely come and join.
(59:53)
There's going to be a lot of fun things happening there. You're definitely not going to want to miss it. And then finally, we are hosting a conference. Can you see why we went down from two to one a month? We're hosting a conference. We're hosting an all virtual free conference, synthesized 2023, which is going to be the developer conference for synthetic data. So we're going to have a lot of really cool speakers, a lot of really cool talks. We're still putting this all together, but go ahead and mark your calendar for February 8th, or you can go to gretel.ai/synthesize 2023. I'll drop this link in the chat here for anyone who wants it. You can register. You'll get emails about updates with everything that's happening with Synthesize. We did the Swac, we did the synthesize, the Discord. Is there anything else you want to add, throw in Kendrick before we're done?
Speaker 2 (01:00:44):
No, I don't think there's anything else. Thanks very much for inviting me on and getting to chat about time series stuff.
Speaker 1 (01:00:52):
I always love bringing people on. I learned so much every time. This is learning time with Mason is how I really much view this. It's always fun to talk with people, especially people who implemented this stuff. You get an inside knowledge of stuff that you wouldn't get just by reading docs or something, so it's definitely worth it. Thank you everyone. We'll be back next month on November 2nd with another Gretel workshop, and until then, have a good day and I'll see you next time. Bye-Bye.