Some say Kaggle is great, some say it’s useless, but no one says it’s boring. Any particular Kaggle competition might be not that much stimulating to someone but Kaggle provides a great assortment - variety of domains, variety of metrics, datasets of different sizes and different origins. And among this wide variety I find competitions with time-series data sets to be the most exciting.
Maybe because time-series competitions provide a lot of possibilities for feature engineering. For some competitions you have hard time to think how to generate additional features, with time-series you can produce so many of them that feature selection becomes the greatest challenge.
Maybe a lot of excitement is due to the fact that time-series competitions are unpredictable. The public leader-board is just a rough indicator of what the final standing would be. On the private leader-board the competitors lose and gain hundreds of places. For example, only three competitors from the public LB top ten survived to the final top ten of Rossmann Store Sales competition. For the most recent time-series competition Grupo Bimbo Inventory Demand the shake-up was less extreme but still significant.
And such shake-ups are very much about cross-validation. With time-series data it is easy to get validation strategy wrong and difficult to get it right. I took part in the both above mentioned competitions. And in this post I would like to share what I learned about cross-validation with time-series data.
But first a few words about general approach to time-series competitions.
General approach to time-series competitions
Very often (maybe too often) a Kaggle competition dataset will look like:
Feature1 | Feature2 | FeatureN | Target |
---|---|---|---|
1 | 20 | 0.045 | 10 |
2 | 15 | 1.100 | 12 |
200 | 12 | 2.100 | 11 |
A flat table of anonymized features - classification or regression target. Such competitions are the most popular at Kaggle. Maybe because one can start training a model straight away - just create design matrix and feed it to xgboost.
But what if we know that one of the features represents time? Like in the table below
Week | Feature2 | FeatureN | Target |
---|---|---|---|
1 | 20 | 0.045 | 10 |
2 | 15 | 1.100 | 12 |
3 | 12 | 2.100 | 11 |
4 | 10 | 0.100 | 9 |
In this case time is in weeks but it could be in years, months, minutes, seconds. The table looks very similar to the first one. And we also could feed this dataset to xgboost as it is - but we’d rather not.
Why?
One reason is that rows, observations, are not i.i.d (independent identically distributed). I.i.d. is just an assumption and for very many of Kaggle competitions it does not hold - the observations are not really independent. But usually we do not know how the observations depend on each other. Many competitions won by figuring that out and kagglers call that a data leak.
But when a time dimension is explicitly given we know that observations are connected with the arrow of time. And we can extract a lot of information from this arrow.
"Tomorrow will be the same day as it was yesterday"
- a poet said. He was right - almost. Today is the same day as yesterday, tomorrow will be the same day as today, and still tomorrow will be slightly different than yesterday. Predict that the weather tomorrow will be the same as the weather today and you will have 70% of accuracy. But to predict weather 3 days ahead you likely will need a more sophisticated approach. Things change with time but most often not abruptly, often slow (sometimes you can even figure out the rate of this slow)
Let’s illustrate that with a simple table.
Week | StoreID | ProductID | Sales |
---|---|---|---|
1 | 1 | 2 | 10 |
2 | 1 | 2 | 12 |
3 | 1 | 2 | 13 |
4 | 1 | 2 | 13 |
1 | 10 | 2 | 100 |
2 | 10 | 2 | 110 |
3 | 10 | 2 | 130 |
4 | 10 | 2 | 120 |
Sales yesterday are good predictor for sales today, and slightly worse but still good predictor for sales tomorrow. And how can we utilize that?
There is a complicated way to do that and it’s called time-series analysis.
But we are doing a machine learning competition and Kaggle’s competitions are won by xgboost - not by ARIMA models.
We would like to do something simple and simple it is - we will just convert weekly sales(the target) to features.
Naive way to do that would be
Week | StoreID | ProductID | Sales | Sales_Week_1 | Sales_Week_2 | Sales_Week_3 | Sales_Week_4 |
---|---|---|---|---|---|---|---|
1 | 1 | 2 | 10 | 10 | 12 | 13 | 13 |
2 | 1 | 2 | 12 | 10 | 12 | 13 | 13 |
3 | 1 | 2 | 13 | 10 | 12 | 13 | 13 |
4 | 1 | 2 | 13 | 10 | 12 | 13 | 13 |
1 | 10 | 2 | 100 | 100 | 110 | 130 | 120 |
2 | 10 | 2 | 110 | 100 | 110 | 130 | 120 |
3 | 10 | 2 | 130 | 100 | 110 | 130 | 120 |
4 | 10 | 2 | 120 | 100 | 110 | 130 | 120 |
Now we have highly predictive features - too much predictive. Feature Week_N is a perfect but useless predictor for the week N. But even if we somehow ensure that during the training Sales_Week_1 are not used for week 1 we still be using week 4 for predicting week 3, predicting past from futures. But even this is not the biggest problem.
The real problem is that predictive power of a feature based on the sales for particular week depends not on the absolute number of this week but on the time distance of this week and the week we are trying to predict.
So instead of creating feature with absolute week numbers - we will create lagged features, features based on relative distance in time.
Week | StoreID | ProductID | Sales | Sales_lag_1 |
---|---|---|---|---|
1 | 1 | 2 | 10 | NA |
2 | 1 | 2 | 12 | 10 |
3 | 1 | 2 | 13 | 12 |
4 | 1 | 2 | 13 | 13 |
1 | 10 | 2 | 100 | NA |
2 | 10 | 2 | 110 | 100 |
3 | 10 | 2 | 130 | 110 |
4 | 10 | 2 | 120 | 130 |
We added feature Sales_lag_1 which is for every row is the sales of the week before. For an observations lagged features are relative to its week.
In the example above we have the lag of one week but we can continue adding Sales two weeks before, three weeks before, and till the end of data (usually not really till the end)
Week | StoreID | ProductID | Sales | Sales_lag_1 | Sales_lag_2 | Sales_lag_3 |
---|---|---|---|---|---|---|
1 | 1 | 2 | 10 | NA | NA | NA |
2 | 1 | 2 | 12 | 10 | NA | NA |
3 | 1 | 2 | 13 | 12 | 10 | NA |
4 | 1 | 2 | 13 | 13 | 12 | 10 |
1 | 10 | 2 | 100 | NA | NA | NA |
2 | 10 | 2 | 110 | 100 | NA | NA |
3 | 10 | 2 | 130 | 110 | 100 | NA |
4 | 10 | 2 | 120 | 130 | 110 | 100 |
But wee see that we have more and more NAs deeper to the past the observations are. It is because when it comes for lag_1 we have a week before sales for all weeks except the first one, for lag_2 we do not have data for the first and the second week and so on.
Okay, we used the time arrow to create some features - are we ready to run xgboost? Not yet - first we should decide which observations should be included into the training set.
And to do that let’s first look at the test set. Suppose we want to predict week 5. The test set will look like
Week | StoreID | ProductID | Sales |
---|---|---|---|
5 | 1 | 2 | ? |
5 | 10 | 2 | ? |
and after adding lagged features
Week | StoreID | ProductID | Sales_lag_1 | Sales_lag_2 | Sales_lag_3 | Sales_lag_4 | Sales |
---|---|---|---|---|---|---|---|
5 | 1 | 2 | 13 | 13 | 12 | 10 | ? |
5 | 10 | 2 | 120 | 130 | 110 | 100 | ? |
Let’s look one more time on the training set with all features. This time let’s arrange it by weeks.
Week | StoreID | ProductID | Sales_lag_1 | Sales_lag_2 | Sales_lag_3 | Sales_lag_4 | Sales |
---|---|---|---|---|---|---|---|
1 | 1 | 2 | NA | NA | NA | NA | 10 |
1 | 10 | 2 | NA | NA | NA | NA | 100 |
2 | 1 | 2 | 10 | NA | NA | NA | 12 |
2 | 10 | 2 | 100 | NA | NA | NA | 110 |
3 | 1 | 2 | 12 | 10 | NA | NA | 13 |
3 | 10 | 2 | 110 | 100 | NA | NA | 130 |
4 | 1 | 2 | 13 | 12 | 10 | NA | 13 |
4 | 10 | 2 | 130 | 110 | 100 | NA | 120 |
We can train model on the whole set. But should we? All the lagged features for the first week are NAs, obviously. There is no history for the first week. There is not that much sense to include the first week into the training. What about the week 2 - less NAs than for week 1 but still many. Maybe we should not include it as well. What about weeks 3? How to decide which week to include? The answer is usual - it depends. Yes, we want to have more features with some data, not NAs, but we also want to have more observations for the train set. It’s a compromise to be made and it depends on the data, on the computational resources we have. Later we will discuss those trade-offs on example of Bimbo competition.
But now I’d like you to pay attention that in the training set Sales_lag_4 is NA. We do have this feature for the test set, but not for the train set. We have more information for the test set but we cannot use it.
What if we were asked to predict the week 6 on the same data?
Week | StoreID | ProductID | Sales_lag_1 | Sales_lag_2 | Sales_lag_3 | Sales |
---|---|---|---|---|---|---|
6 | 1 | 2 | NA | 13 | 13 | ? |
6 | 10 | 2 | NA | 120 | 130 | ? |
Week six is the second week of the test set, and we do not have Sales for the week before it. So the features available have lag of 2 or more, and means that for training we also can use only features with lag 2 or more.
Week | StoreID | ProductID | Sales_lag_2 | Sales_lag_3 | Sales_lag_4 | Sales |
---|---|---|---|---|---|---|
1 | 1 | 2 | NA | NA | NA | 10 |
1 | 10 | 2 | NA | NA | NA | 100 |
2 | 1 | 2 | NA | NA | NA | 12 |
2 | 10 | 2 | NA | NA | NA | 110 |
3 | 1 | 2 | 10 | NA | NA | 13 |
3 | 10 | 2 | 100 | NA | NA | 130 |
4 | 1 | 2 | 12 | 10 | NA | 13 |
4 | 10 | 2 | 110 | 100 | NA | 120 |
But maybe we can create more features? Yes, we can and we should. We should create aggregation features.
Aggregation features
So far we were added lagged Sales as features. The example was simple and we have only 4 weeks of training data. But what if we have 200 weeks(4 years) of data. Do we want add 200 features - sales with lag 1 to 200? Most likely not. The most recent week is a good predictor, a week before the most recent could be a good predictor as well but the more in the past the less is predictive power. But still some information is there. How to use it? It’s time to aggregate, summarize. Instead of just including all possible lagged weekly sales as separate features we will aggregate them. For example we could include mean sales of all weeks before the current, like that.
Week | StoreID | ProductID | Sales | mean_Sales_lag_1_ProdStore |
---|---|---|---|---|
1 | 1 | 2 | 10 | NA |
2 | 1 | 2 | 12 | 10.00000 |
3 | 1 | 2 | 13 | 11.00000 |
4 | 1 | 2 | 13 | 11.66667 |
1 | 10 | 2 | 100 | NA |
2 | 10 | 2 | 110 | 100.00000 |
3 | 10 | 2 | 130 | 105.00000 |
4 | 10 | 2 | 120 | 113.33333 |
in that case we grouped by the tuple (StoreID, ProductID). But the mean sales of a store as whole could have predictive power, and mean sales of a product for all stores could be predictive so we could group just by StoreID, and add a feature like mean sales of a store for all weeks before the current. We can do the same for products.
Week | StoreID | ProductID | Sales | mean_Sales_lag_1_ProdStore | mean_Sales_lag_1_Prod | mean_Sales_lag_1_Store |
---|---|---|---|---|---|---|
1 | 1 | 2 | 10 | NA | NA | NA |
2 | 1 | 2 | 12 | 10.00000 | 55.0 | 10.00000 |
3 | 1 | 2 | 13 | 11.00000 | 58.0 | 11.00000 |
4 | 1 | 2 | 13 | 11.66667 | 62.5 | 11.66667 |
1 | 10 | 2 | 100 | NA | NA | NA |
2 | 10 | 2 | 110 | 100.00000 | 55.0 | 100.00000 |
3 | 10 | 2 | 130 | 105.00000 | 58.0 | 105.00000 |
4 | 10 | 2 | 120 | 113.33333 | 62.5 | 113.33333 |
Again the example is very simple - just 4 weeks. What if we have 150 or 500 weeks of data should we summarize over all of them? Maybe two years ago something changed dramatically and now the distribution is very different? Or maybe we have some seasonality in the sales? So the question is how much to the past we would like to look when creating the aggregation features?
There is no universal recipe. But it’s important to realize that the size of the window to the past becomes one of the parameters of the system. And when we have a lot of data we could compute features for several different windows. The most recent tendency - for example 6 weeks, we could have 26 weeks (half of year) as another window, and all historical data could be used as well. We could have decaying weights for computing averages - and all these options become hyper-parameters of our model. We could also compute not just means, but on medians, quartiles, higher order moments. And if we have features like ProductCategory or StoreType we could aggregate on them as well and on their combination with other categorical features.
The number of engineered features could explode rapidly. Very rapidly indeed and with this search for more and more information we could start overfitting very quickly.
We need a solid cross-validation strategy to control overfitting. And this post is about cross-validation but the introduction to time-series feature engineering took too much space - so I will split the post to two parts.
And in the second part I’ll be looking at cross-validation on the example of Grupo Bimbo Inventory Demand competition.
No comments:
Post a Comment