Sunday, October 30, 2016

Cross-validation with time-series datasets - part two - Bimbo

Recently Kaggle launched two new competitions with time-series data - Santander Product Recommendation and Outbrain Click Prediction . And as with so many time-series competitions a choice of cross-validation strategy will be decisive factor. It is right time to look at cross-validation for time-series data on an example of a past competition. In the first part of the post we looked at the general approach to time-series competition, on how to create features using time dependencies. And conclusion was that one can create a lot of features, so many features that it will be really easy to overfit, so a solid cross-validation strategy is extremely important. In this part I will use a recent Kaggle competition Grupo Bimbo Inventory Demand to look at possible approaches to cross-validation with time-series datasets.

Bimbo Group supplies over 1 million stores in Mexico with bakery and related products. In the competition Bimbo Group asked participants to predict how much products they should supply to each of their clients(stores). And they gave a lot of data - the train set contained 75 millions rows. We are given 7 weeks of sales of Bimbo to their clients. We are also given other data (such as returns of non-sold products). The full description of the competition and its datasets are on the competition web page. But I’d like to concentrate on cross-validation and omit less relevant details. So for the purpose of this blog I’ve made the following simplifications to the data:

  • the columns were renamed to English equivalents
  • the target is Sales in pesos
    • features like sales in units and returns are ignored
  • the number of categorical feature reduced
  • I do not distinguish between a client and a store
  • the weeks numbering starts from 1

So the head of the modified train set looks like

Week Depo Product Client Sales
1 1110 1453543 1242 45.84
1 1110 9503385 5380 14.82
1 1110 1159580 30571 68.75
1 1110 198780 42434 45.8
1 1110 9503385 37057 15
1 1110 18641 31719 30.36
1 1110 1457784 30571 43.75
1 1110 817243 1146 21.39
1 1110 181232 1309 13.52

It’s just first 8 rows of the train set and in this competition we were given the data set of 75 millions of rows. Which on the one hand looks like a lot but on the other hand - considering that there are almost 900,000 of clients and 2,600 products - it is not really much. For example, when it comes to tuples (Product, Client) the data are sparse and indeed there are more that millions of such tuples which are in the test set and not presented in the train set at all. And the size of the data and this sparsity are important when choosing cross-validation strategy.
But the most important factor when deciding on a cross-validation approach of any competition is how the test set is split to the public and private subsets.

For those who has never taken part in Kaggle competition - a few words about what private/test split means. The final standing of a Kaggle competition decided by how well participants models performed on the test set - for which, obviously, the competitors do not know the target variable. During a competition the preliminary ranking of the participants could be seen on the public leaderboard. This lederboard shows how submitted solutions scored on a portion of the test set. This portion is called public test set. But the final results of the competition will be decided by the score on the private portion of the full test set. And the difference between the preliminary standing on the public leaderboard and the final standing on the private leaderboard might be significant, very significant - participants could lose hundreds of places. For example, only three competitors who were in the top ten by the public LB made it to the final top ten of Rossmann Store Sales competition, which also was a time-series one.

The usual problem is that participants overfit to the public test set.
It helps to avoid overfitting when you know how the test set was split. On many competitions the organizers split the test set randomly - like 30% public portion, 70% private portion. But for time series competitions more natural way of splitting the test set is by time.

And the test set of Bimbo competition was split by time.
In this competition we were given seven weeks of data to train a model and we were asked to predict the sales for the following two weeks.
The competitions organizers clearly stated that test set split about 50%/50% private/public and the split is by time. It means that one of the test set weeks is used for the public leaderboard and the other for the private leaderboard. The organized did not say which week is private which is public. But one can spend just two submissions to easily figure that out. In case of Bimbo the first week of the test set was public and the second week was private.

We want our cross-validation score to tell us how our model will perform on the private leaderboard - and it means that cross-validation arrangements should reflect how the final evaluation of the competition is done. But with time-series data not only cross-validation but the entire approach to the competition depends on how the test set is split.

Why?

Because the final result depends only on how well the model performs on the second week of the test set, for the final standing it is not important how well you predicted the first week. It’s maybe easier to predict the fist week of the test set as it’s closer to the train set - but the final score is decided by the second week. And as I illustrated in the first part we will have less features for the second week of the test set as we just do not have Sales of the previous week, we can use only Sales with lag 2 or more to engineer feature. And it might mean that we should have two different models - one for the first week of the test set and the other for the second week. But when we know that only the second week matters we actually do not need to build model for the first week. But we also should understand that some participants of the competition will do that and they will look good at the public leaderboard.

So we have made a decision to build a model to predict only the second week of the test set. It means that for feature engineering we can only use Sales with lag two and more, we cannot use the immediately preceding week, because we do not have sales for the week immediately before the second week of the test set.

And we should make sure that our validation set is not contaminated with the data which the second week of test set cannot have. And such contamination could be subtle. For example, some common approaches to encoding categorical variables could leak information to validation sets - and the cv score would be overoptimistic.

Ideally we would like our cross-validation arrangements to emulate private/public test split as close as possible. In case of Bimbo it means that we should hold the last two weeks of the train set as the validation set and we will train model on the remaining data and when predicting the holdout set we will have honest, but slighly pessimistic, estimation of of public score and private score. The first week of the validation set is an analogue of the public portion of the test set, and the second week corresponds to the public part of the test set. It’s even possible to estimate the difference between performance of the model on the public and private leaderboards.

But we rarely can do ideal things. There is a question - how much pessimistic is that slighlty. And with this ideal approach it could be too much.
We have 7 weeks of history to train our models. If we hold the last two weeks as a validation set just 5 weeks of the data are left for training. Not that much. Moreover, as we are interested in predicting the second week of the test set we will use engineered features with lag 2 or more. But the first two week of the train set do not have such features so we have reduced our train set significantly. And when it comes to aggregation features - they are not reliable when computed on just 5 weeks or even less.

And in this particular competition it was even more significant as when we look at a tuple (Product, Client) we see that in the seven weeks of the train data a particular product could be delivered to the store only for two, three or often just for one of the seven weeks. So when we withhold two weeks of the train data we lose a lot of information.

As a result cv score could become not just pessimistic but unreliable because the model would be too weak and very different from the one trained on the whole train set.

But do we really need to hold two weeks for validation set? We don’t. As we are really interested in predicting the second week of the train set we can hold for validation just one week - it will mimic the second week of the test set. We keep week 7(the last week of the train set) as the validation set and when we compute features for this week we do not not use Sales of the previous week (week 6) because the second week of the test set doesn’t have this information.

So far we have made three important decisions:

  • we build model to predict only the second week of the test set
  • we hold the last week of the training set as the validation set which mimics the second week of the test set
  • we engineer features using lag 2 and more

One more decision to be made. Which weeks of the train set will we use to train models?

First, look at the table below

Week Product Client mean_Sales mean_Sales_Client mean_Sales_Product Sales_lag_2
3 357 43507 NA 22.41 26.52 NA
3 30415 43507 NA 22.41 25.73 NA
3 35144 43507 20.5 22.41 56.04 20.5
3 30532 43507 38.9 22.41 48.22 38.9
3 36748 43507 NA 22.41 30.31 NA
3 43067 43507 NA 22.41 19.13 NA
3 43069 43507 7.41 22.41 34.46 7.41
3 43285 43507 NA 22.41 70.25 NA
4 357 43507 29.16 24.7 27.04 29.16
4 30415 43507 7.5 24.7 25.96 7.5
4 35141 43507 41 24.7 48.96 NA
4 35144 43507 20.5 24.7 55.77 20.5
4 30532 43507 42.79 24.7 50.29 46.68
4 36748 43507 NA 24.7 30 NA
4 43067 43507 NA 24.7 18.75 NA
4 43069 43507 14.82 24.7 34.62 22.23
4 43285 43507 63.36 24.7 68.89 63.36
5 357 43507 29.16 27.57 27.28 29.16
5 30415 43507 11.25 27.57 25.67 15
5 35141 43507 41 27.57 49.05 NA
5 35144 43507 20.5 27.57 55.01 20.5
5 30532 43507 51.87 27.57 49.96 70.02
5 36748 43507 11.53 27.57 29.79 11.53
5 43067 43507 19.26 27.57 18.63 19.26
5 43069 43507 12.35 27.57 33.81 7.41
5 43285 43507 84.48 27.57 68.73 105.6
6 357 43507 29.16 26.33 27.12 29.16
6 35141 43507 30.75 26.33 49 20.5
6 35144 43507 20.5 26.33 54.2 20.5
6 30532 43507 46.68 26.33 49.85 31.12
6 43285 43507 73.92 26.33 68.63 52.8

the table illustrates some points about the choices we discussed above. I do not include weeks 1 and 2 as all the features computed with lag of two weeks are empty. And wee see that for week 3 a lot of features are empty as well. And all aggregation features for week 3 were computed just for one week - so they are not really aggregations. In this particular competition and usually with time-series data - central tendencies features like mean or median are the most powerful. But they need enough data-points to become reliable. It means that weeks 3,4 are just do not have robust features.

So we are left with weeks 5,6,7 for final models, and with week 5,6 for cross-validation models.
We can train model on weeks 5,6,7 , or 6,7 or just 7. We can do all those combinations and than use the predictions on the ensemble level.
Still we cannot reliably compare performance of the model trained on weeks 5,6,7 and the model on week 6,7 using just cross-validation.

But in this case a feedback of public leaderboard could be very helpful.
I would like to elaborate on that.
I already mentioned that public LB is based on the first week of the test set - and this week is easier to predict than the second as it is closer to train set and we have more information to train a model. But there is not much sense to look good at the public LB but fail miserably at the private LB.
What we could do instead - is to train a separate model for the first week of the test set - which does not use the last week of the train set at all. Yes, we throw away a whole week of the train set but we treat the first week of the test as the second. So score on the public leaderboard will a become much better estimate of the private score, and also now we can evaluate how well our CV procedure works and adjust it if we see problems.

Summing everything up we have the following cross-validation procedure for Bimbo competition:

  • We keep the last week of train set(week 7) as validation set.
  • We compute features for the validation set and for training set with lag 2 or more (the same condition as for private part of the test set)
  • We train model for validation using week 6 (and features based on weeks 1-4). We make prediction and compute score for week 7. We use the same model to predict the first week of the test set(week 8).
  • For the second week of the test set(week 9) we build a new model, this time using all the data of the train set but still all lagged features are computed with lag 2 or more. We train this model on the last week of the train set.

With this approach we have reliable cross-validation plus public LB score becomes a good proxy for private LB score.

Such approach is conservative, and CV score would be a little bit pessimistic - and there might be reason when you want to be more aggressive. But it is not the absolute cross-validation score of a model which matters it is that the CV score moves in the same direction as the test score - and with this approach and using LB set as additional validation one is much less likely to overfit.

It has been a long post but having participated in several competitions with time-series data I have learnt that it is much more effective to invest time, run some experiments to get a solid cross-validation strategy at the very beginning of any time-series competition. Feature engineering, hyper-parameters tuning, stacking will work only if you can rely on your cross-validation score.

P.S.
Cross-validation approach described in the post works well but mostly for one-level models. For any ensemble which uses something more complex than a weighted average cross-validation becomes much more complex. And cross-validation strategy for ensembles is a huge topic on itself.

Saturday, October 15, 2016

Cross-validation with time-series datasets - Part One

Some say Kaggle is great, some say it’s useless, but no one says it’s boring. Any particular Kaggle competition might be not that much stimulating to someone but Kaggle provides a great assortment - variety of domains, variety of metrics, datasets of different sizes and different origins. And among this wide variety I find competitions with time-series data sets to be the most exciting.
Maybe because time-series competitions provide a lot of possibilities for feature engineering. For some competitions you have hard time to think how to generate additional features, with time-series you can produce so many of them that feature selection becomes the greatest challenge.
Maybe a lot of excitement is due to the fact that time-series competitions are unpredictable. The public leader-board is just a rough indicator of what the final standing would be. On the private leader-board the competitors lose and gain hundreds of places. For example, only three competitors from the public LB top ten survived to the final top ten of Rossmann Store Sales competition. For the most recent time-series competition Grupo Bimbo Inventory Demand the shake-up was less extreme but still significant.

And such shake-ups are very much about cross-validation. With time-series data it is easy to get validation strategy wrong and difficult to get it right. I took part in the both above mentioned competitions. And in this post I would like to share what I learned about cross-validation with time-series data.

But first a few words about general approach to time-series competitions.

General approach to time-series competitions

Very often (maybe too often) a Kaggle competition dataset will look like:

Feature1 Feature2 FeatureN Target
1 20 0.045 10
2 15 1.100 12
200 12 2.100 11

A flat table of anonymized features - classification or regression target. Such competitions are the most popular at Kaggle. Maybe because one can start training a model straight away - just create design matrix and feed it to xgboost.

But what if we know that one of the features represents time? Like in the table below

Week Feature2 FeatureN Target
1 20 0.045 10
2 15 1.100 12
3 12 2.100 11
4 10 0.100 9

In this case time is in weeks but it could be in years, months, minutes, seconds. The table looks very similar to the first one. And we also could feed this dataset to xgboost as it is - but we’d rather not.

Why?

One reason is that rows, observations, are not i.i.d (independent identically distributed). I.i.d. is just an assumption and for very many of Kaggle competitions it does not hold - the observations are not really independent. But usually we do not know how the observations depend on each other. Many competitions won by figuring that out and kagglers call that a data leak.

But when a time dimension is explicitly given we know that observations are connected with the arrow of time. And we can extract a lot of information from this arrow.

"Tomorrow will be the same day as it was yesterday" - a poet said. He was right - almost. Today is the same day as yesterday, tomorrow will be the same day as today, and still tomorrow will be slightly different than yesterday. Predict that the weather tomorrow will be the same as the weather today and you will have 70% of accuracy. But to predict weather 3 days ahead you likely will need a more sophisticated approach. Things change with time but most often not abruptly, often slow (sometimes you can even figure out the rate of this slow)

Let’s illustrate that with a simple table.

Week StoreID ProductID Sales
1 1 2 10
2 1 2 12
3 1 2 13
4 1 2 13
1 10 2 100
2 10 2 110
3 10 2 130
4 10 2 120

Sales yesterday are good predictor for sales today, and slightly worse but still good predictor for sales tomorrow. And how can we utilize that?
There is a complicated way to do that and it’s called time-series analysis.
But we are doing a machine learning competition and Kaggle’s competitions are won by xgboost - not by ARIMA models.

We would like to do something simple and simple it is - we will just convert weekly sales(the target) to features.
Naive way to do that would be

Week StoreID ProductID Sales Sales_Week_1 Sales_Week_2 Sales_Week_3 Sales_Week_4
1 1 2 10 10 12 13 13
2 1 2 12 10 12 13 13
3 1 2 13 10 12 13 13
4 1 2 13 10 12 13 13
1 10 2 100 100 110 130 120
2 10 2 110 100 110 130 120
3 10 2 130 100 110 130 120
4 10 2 120 100 110 130 120

Now we have highly predictive features - too much predictive. Feature Week_N is a perfect but useless predictor for the week N. But even if we somehow ensure that during the training Sales_Week_1 are not used for week 1 we still be using week 4 for predicting week 3, predicting past from futures. But even this is not the biggest problem.
The real problem is that predictive power of a feature based on the sales for particular week depends not on the absolute number of this week but on the time distance of this week and the week we are trying to predict.

So instead of creating feature with absolute week numbers - we will create lagged features, features based on relative distance in time.

Week StoreID ProductID Sales Sales_lag_1
1 1 2 10 NA
2 1 2 12 10
3 1 2 13 12
4 1 2 13 13
1 10 2 100 NA
2 10 2 110 100
3 10 2 130 110
4 10 2 120 130

We added feature Sales_lag_1 which is for every row is the sales of the week before. For an observations lagged features are relative to its week.
In the example above we have the lag of one week but we can continue adding Sales two weeks before, three weeks before, and till the end of data (usually not really till the end)

Week StoreID ProductID Sales Sales_lag_1 Sales_lag_2 Sales_lag_3
1 1 2 10 NA NA NA
2 1 2 12 10 NA NA
3 1 2 13 12 10 NA
4 1 2 13 13 12 10
1 10 2 100 NA NA NA
2 10 2 110 100 NA NA
3 10 2 130 110 100 NA
4 10 2 120 130 110 100

But wee see that we have more and more NAs deeper to the past the observations are. It is because when it comes for lag_1 we have a week before sales for all weeks except the first one, for lag_2 we do not have data for the first and the second week and so on.

Okay, we used the time arrow to create some features - are we ready to run xgboost? Not yet - first we should decide which observations should be included into the training set.
And to do that let’s first look at the test set. Suppose we want to predict week 5. The test set will look like

Week StoreID ProductID Sales
5 1 2 ?
5 10 2 ?

and after adding lagged features

Week StoreID ProductID Sales_lag_1 Sales_lag_2 Sales_lag_3 Sales_lag_4 Sales
5 1 2 13 13 12 10 ?
5 10 2 120 130 110 100 ?

Let’s look one more time on the training set with all features. This time let’s arrange it by weeks.

Week StoreID ProductID Sales_lag_1 Sales_lag_2 Sales_lag_3 Sales_lag_4 Sales
1 1 2 NA NA NA NA 10
1 10 2 NA NA NA NA 100
2 1 2 10 NA NA NA 12
2 10 2 100 NA NA NA 110
3 1 2 12 10 NA NA 13
3 10 2 110 100 NA NA 130
4 1 2 13 12 10 NA 13
4 10 2 130 110 100 NA 120

We can train model on the whole set. But should we? All the lagged features for the first week are NAs, obviously. There is no history for the first week. There is not that much sense to include the first week into the training. What about the week 2 - less NAs than for week 1 but still many. Maybe we should not include it as well. What about weeks 3? How to decide which week to include? The answer is usual - it depends. Yes, we want to have more features with some data, not NAs, but we also want to have more observations for the train set. It’s a compromise to be made and it depends on the data, on the computational resources we have. Later we will discuss those trade-offs on example of Bimbo competition.

But now I’d like you to pay attention that in the training set Sales_lag_4 is NA. We do have this feature for the test set, but not for the train set. We have more information for the test set but we cannot use it.

What if we were asked to predict the week 6 on the same data?

Week StoreID ProductID Sales_lag_1 Sales_lag_2 Sales_lag_3 Sales
6 1 2 NA 13 13 ?
6 10 2 NA 120 130 ?

Week six is the second week of the test set, and we do not have Sales for the week before it. So the features available have lag of 2 or more, and means that for training we also can use only features with lag 2 or more.

Week StoreID ProductID Sales_lag_2 Sales_lag_3 Sales_lag_4 Sales
1 1 2 NA NA NA 10
1 10 2 NA NA NA 100
2 1 2 NA NA NA 12
2 10 2 NA NA NA 110
3 1 2 10 NA NA 13
3 10 2 100 NA NA 130
4 1 2 12 10 NA 13
4 10 2 110 100 NA 120

But maybe we can create more features? Yes, we can and we should. We should create aggregation features.

Aggregation features

So far we were added lagged Sales as features. The example was simple and we have only 4 weeks of training data. But what if we have 200 weeks(4 years) of data. Do we want add 200 features - sales with lag 1 to 200? Most likely not. The most recent week is a good predictor, a week before the most recent could be a good predictor as well but the more in the past the less is predictive power. But still some information is there. How to use it? It’s time to aggregate, summarize. Instead of just including all possible lagged weekly sales as separate features we will aggregate them. For example we could include mean sales of all weeks before the current, like that.

Week StoreID ProductID Sales mean_Sales_lag_1_ProdStore
1 1 2 10 NA
2 1 2 12 10.00000
3 1 2 13 11.00000
4 1 2 13 11.66667
1 10 2 100 NA
2 10 2 110 100.00000
3 10 2 130 105.00000
4 10 2 120 113.33333

in that case we grouped by the tuple (StoreID, ProductID). But the mean sales of a store as whole could have predictive power, and mean sales of a product for all stores could be predictive so we could group just by StoreID, and add a feature like mean sales of a store for all weeks before the current. We can do the same for products.

Week StoreID ProductID Sales mean_Sales_lag_1_ProdStore mean_Sales_lag_1_Prod mean_Sales_lag_1_Store
1 1 2 10 NA NA NA
2 1 2 12 10.00000 55.0 10.00000
3 1 2 13 11.00000 58.0 11.00000
4 1 2 13 11.66667 62.5 11.66667
1 10 2 100 NA NA NA
2 10 2 110 100.00000 55.0 100.00000
3 10 2 130 105.00000 58.0 105.00000
4 10 2 120 113.33333 62.5 113.33333

Again the example is very simple - just 4 weeks. What if we have 150 or 500 weeks of data should we summarize over all of them? Maybe two years ago something changed dramatically and now the distribution is very different? Or maybe we have some seasonality in the sales? So the question is how much to the past we would like to look when creating the aggregation features?

There is no universal recipe. But it’s important to realize that the size of the window to the past becomes one of the parameters of the system. And when we have a lot of data we could compute features for several different windows. The most recent tendency - for example 6 weeks, we could have 26 weeks (half of year) as another window, and all historical data could be used as well. We could have decaying weights for computing averages - and all these options become hyper-parameters of our model. We could also compute not just means, but on medians, quartiles, higher order moments. And if we have features like ProductCategory or StoreType we could aggregate on them as well and on their combination with other categorical features.

The number of engineered features could explode rapidly. Very rapidly indeed and with this search for more and more information we could start overfitting very quickly.

We need a solid cross-validation strategy to control overfitting. And this post is about cross-validation but the introduction to time-series feature engineering took too much space - so I will split the post to two parts.

And in the second part I’ll be looking at cross-validation on the example of Grupo Bimbo Inventory Demand competition.