Friday, February 24, 2017

Data augmentation of convolutional feature maps


Image augmentation is a powerful technique allowing to improve the performance of an image classification system. With image augmentation, you can extract more information from a dataset, which is especially important when your dataset is not big enough. Some could argue that you never have too much data. Even with such big datasets as of Imagenet large scale visual recognition challenge winners of the competition augment data during the training and prediction phases. One could conclude that you should always apply image augmentation.
Yes, you should - when it is feasible. Image augmentation has its cost, and the cost could be high.

Transfer learning is another powerful technique allowing to achieve strong classification performance even with small datasets. Even when you have a larger dataset using pre-trained networks considered to be the default first step which will save you a lot of computation time.
It took weeks to train VGG network, and now we can get their pre-trained model for free and use its weights as a starting point and fine-tune them or very commonly just use the output of the last convolutional layer.
It has been shown that convolutional feature maps of model pre-trained on Imagenet like VGG, or AlexNet, or whatever you choose from Caffe zoo could be used on a range of applications even for images which are not similar at all to the ImageNet dataset.
I also should point out that even though most of the pre-trained model were trained on smaller resolutions (like 224x224) you can apply them to high-resolution images if you replace the fully-connected top layers with convolutional layers.

We have two powerful techniques - image augmentation and transfer learning - could we use them together?
In theory - yes.
But both of the techniques have associated cost which you pay in time and resources, and the combined cost of using both could be prohibitive.
It is especially true when you need to use high-resolution images without down-sampling. For example, you need to recognize objects which are not dominant, or you perform a fine-grained classification.

For high-resolution images just applying pre-trained network  takes a lot of time.
If for every image of your train set for every epoch you pass the image through all the layers of the pre-trained network and then your layers it could be very slow.
A common technique to speed thing up - is to apply a pre-trained network once on your images and save the output of the last convolution layer. Now the saved feature maps are input to your model.
Then you train the network which is specific to your task.
It is much, much faster - two orders of magnitude faster.
But now you cannot augment the original images anymore. You work with feature maps, not images.

Could we somehow use the fact that space of feature maps and space of the original image are related? The top-left corner of an image map roughly corresponds to the top-left area of the original image. For example, in the case of VGG  the top-left "pixel" of the last convolutional layer feature map corresponds to the top-left 16x16 rectangle of the image.

A common transformation applied on image level is to take a random crop. And we can crop feature maps. The resolution of image map is much lower - but  10-20% of crop on the image level would correspond to crops of 2-3 feature map "pixels".
Random horizontal flip is another common transformation, and we can flip feature maps. Technically such transformations of feature maps are possible and easy to implement.

But the real questions is - do they useful? Could we replace image level transformations with feature map transformations and achieve a similar effect?

There is a very interesting paper discussing spatial transformations of feature maps in details.

However, the goal of this post is not a theory I just would like to apply image augmentation and feature map transformation on a dataset, and compare performance and running time.

I will train 3 models on very small subsets Kaggle's Dogs&Cats dataset.
And I mean very small - we will start with a train set of just 10 samples(5 dogs and 5 cats), then 20, 40, 80 samples. I will use pre-trained VGG to produce convolutional feature maps and train a small fully convolutional network on top of it.

You can look at the Augmenting Feature Maps notebook to find the details, the code, the architecture of the model.

Here we go straight to the results

n_samples model accuracy logloss sec/epoch
10 baseline 0.735 0.476 0
10 feature maps augmentation 0.853 0.363 0
10 image augmentation 0.806 0.395 11
20 baseline 0.908 0.296 0
20 feature maps augmentation 0.915 0.264 0
20 image augmentation 0.932 0.257 11
40 baseline 0.929 0.236 0
40 feature maps augmentation 0.933 0.221 0
40 image augmentation 0.932 0.216 12
80 baseline 0.941 0.199 0
80 feature maps augmentation 0.941 0.206 0
80 image augmentation 0.934 0.200 14


What is amazing is the power of pre-trained models, with VGG features and data augmentation a very simple model achieves  reasonable performance trained on just 10 samples, and very good performance on just 40 samples.
We also see that data augmentation helps a lot with the 10 samples, helps with 20 samples subset and then we observe diminishing return - with 40 samples the baseline model is catching up, with 80 the performance is practically the same.
We also see that training with image augmentation is much slower, and there is no slowdown with feature map augmentation.

Still it is very important to point out that there are more possible augmentations which you can apply on the image level. You could augment color, lightening conditions, you can do more elaborate spatial transformations.  Such augmentation cannot be done on feature maps level.

The obvious benefit of feature map transformation is that you can boost performance with very little cost. Also such transformations could be implemented as a custom layer of a network.

It would be interesting to try feature map transformation on a more challenging dataset. Which I will do soon and share the results here.

P. S.
For those who would like to find more about data augmentation and transfer learning I would recommend the following papers:

Spatial Transformer Networks
How transferable are features in deep neural networks?
Some improvements on deep convolutional neural network based image classification.



Tuesday, January 3, 2017

Cross-validation in the presence of outliers

In November 2016 Kaggle Allstate Claims Severity competition finished. Unusually for a recruiting competition it turned out to be big - more than 3000 participants, top kaggle guns, very close battle till the very end.
But it is not about number of participants - it is about what you learned by participating. I learned a lot during the competition and this post is on the first lesson.

Cross-validation in the presence of outliers

First, a few words about the competition itself. The organizers provided a dataset of 130 fully anonymized features, about 200,000 observations. It was a regression problem - the target was a positive number called loss. The metric was MAE - mean absolute error.
A typical ensemble battle one would predict. And indeed it was won by an ensemble at the end, but still there were many twists and interesting observations on the way to the finish line.

Let’s look at the distribution of the target variable.

The distribution is very skewed, high proportion of outliers, many extreme outliers.
I am using here Tukey’s test to detect outliers. According to it - an observation of variable \(y\) is considered an outlier if \(y > Q3 + c*IQR\) or \(y < Q1 - c*IQR\), where \(Q1\) is the first quartile, \(Q3\) is the third quartile, and \(IQR\) is interquartile range \((Q3 -Q1)\). \(c = 1.5\) indicates an outlier, and with \(c >= 3\) an outlier considered to be extreme.

The boxplot above marks outliers(for \(c = 3\)) as red dots. We see the all outliers are on the right side. For this particular competition the target had low bound of zero.
From here I will be looking only to the outliers which are above \(Q3 + c*IQR\)

c n_out frac_out
1.5 11554 0.061
2 7638 0.041
3 3431 0.018
4 1632 0.009
5 823 0.004
6 459 0.002

The table above shows number of outliers(n_out) and proportion(frac_out) of outliers in the train dataset for different values of \(c\)

There are several different strategies how to deal with outliers when you train a model but in this post I look only at the question how to do cross-validation for regression tasks with the presence of outliers.

Stratify or not?

The main question I’m trying to answer - should be validation sets stratified by outliers?

First, when stratification is usually applied?
For classification tasks stratification is performed to ensure that proportions of the target classes in the validation folds are the same as in the whole population. It is especially important when the classes are not well balanced.

Following the same logic one wants to preserve proportion of outliers in train and validation folds. But still it is not that obvious that it is important - we are not predicting whether an observation is an outlier, we are doing regression.

The goal of cross-validation is to test and tune a model - and we want the improvements to the model found with cross-validation to generalize to the real test set. A common approach to tuning a model is to use k-fold cross-validation and run grid or random search to find the best set of hyperparametrs. For each run we get values of the loss function for each of the folds, we compute the mean of the loss across the folds, and we use this value to select the best set of the parameters. But the loss for different folds varies, and we should be sure that improvement we observe are significant, they are to changes to the model, not just to the noise introduced when we created the folds.
It is very desirable to reduce variance of the loss induced just by the way we split the train set to the folds, and if stratification reduces the variance we should use it.

To find out whether stratification really reduces the variance let’s run a statistical test. The hypothesis is that by creating k-folds stratified by outliers we will increase homogeneity of the folds.
How do we measure the homogeneity?
We will compute difference between a validation fold mean and the mean of the corresponding train folds and will use standard deviation of this differences across the folds as the test statistic.
For example, for 5-folds validation:
The general population is the set of all possible 5-folds splits of the train set. One unit of this population is one split to 5 folds.
As the test statics we measure the following number - for each validation folds (and we have 5 of them) we compute the distance between mean of the target of this fold and of the 4 remaining folds(the train folds). We will have 5 numbers and the test statistic is standard deviation of this 5 numbers. (10 in the case of 10-folds splits).
I have chosen this statistics because we are interested not in the distances themselves but how much they vary for different folds - the more the variation the more the noise which was added just by the splitting.

The formal hypothesis statement is:
The null hypothesis(\(H_0\)) is that stratification has no effect of the test statistic defined above and the alternative hypothesis(\(H_1\)) is that stratification will reduce it.

The general population is the all possible splits of the train set to k-folds (for k in {5,10}).
The affected population is the set of k-folds stratified by outliers.

We are going to perform test for 5-folds and 10-folds settings, and for different \(c\) of deciding whether observation is an outlier for \(c\) in {1.5,2,3,4,5}.

The level of significance (\(\alpha\)) - 0.05.

First we need to find out the mean of the general population. To do that I generated 1000 k-folds splits and compute the mean of the test statistic.
To compute the test statistic for stratified folds I generated 200 splits for each combination of \(k\) and \(c\).

The results are in the table below

k(n_folds) c expected observed diff se p_val
5 1.5 8.048 5.379 -2.668 0.1609 4.458e-62
5 2 8.048 5.904 -2.144 0.1788 2.048e-33
5 3 8.048 6.343 -1.705 0.1971 2.635e-18
5 4 8.048 7.059 -0.9885 0.2246 5.383e-06
5 5 8.048 7.043 -1.005 0.2083 7.057e-07
10 1.5 12.3 8.327 -3.974 0.1721 2.961e-118
10 2 12.3 8.501 -3.8 0.1747 3.128e-105
10 3 12.3 9.947 -2.353 0.1983 8.569e-33
10 4 12.3 10.61 -1.687 0.2068 1.729e-16
10 5 12.3 10.92 -1.385 0.205 7.105e-12

The results are significant.
We can conclude that stratification by outliers makes validation folds more similar to the train folds - which is desirable.

Still we are really interested in the question how splitting affects model tuning. Could we measure this effect directly?
We could. Let’s do a similar test but this time the test statistic will be computed as the standard deviation of the performance of a model.
We will do that for two learners. The first is the standard least-squares regression (as implemented with LinearRegression class of scikit-learn) and the second is a more robust to outliers HuberRegression. The loss function is MAE - mean absolute error.

The null hypothesis is the same - the stratification does not affect variance of the loss of validation folds.
The alternative hypothesis - stratification reduces it.

We will run the test for 5 folds splits and for different \(c\) in {1.5,2,3,4,5} of outliers definition

The general population is the same, and the affected populations are the same. Only thing that was changed is the test statistic.

To compute mean of the test statistic for general population (all possible k-folds) I first trained a learner for 200 times on k-folds (k*200 runs in total) and computed standard deviation of loss on each of k-folds. It gave us two samples of the size 200 - their means are good estimations of the population means.

To compute the test statistic for affected populations we will have a sample of the size 50 for each combination of a learner and \(c\). We are going to do that only for 5-folds splitting.

learner k(n_folds) c expected observed diff se p_val
LinearRegression 5 1.5 6.602 5.941 -0.6612 0.2646 0.00623
LinearRegression 5 2 6.602 5.935 -0.667 0.3085 0.01532
LinearRegression 5 3 6.602 5.084 -1.518 0.2595 2.434e-09
LinearRegression 5 4 6.602 5.246 -1.357 0.2675 1.98e-07
LinearRegression 5 5 6.602 5.465 -1.138 0.2421 1.3e-06
HuberRegressor 5 1.5 9.526 7.94 -1.586 0.3521 3.335e-06
HuberRegressor 5 2 9.526 7.732 -1.794 0.4157 7.96e-06
HuberRegressor 5 3 9.526 7.502 -2.025 0.4148 5.294e-07
HuberRegressor 5 4 9.526 8.22 -1.306 0.4968 0.004285
HuberRegressor 5 5 9.526 9.121 -0.4056 0.4908 0.2043

Interestingly we see that effect is more pronounced for bigger \(c\). With \(c\) in {3,4,5} we have fewer outliers but they are extreme and the chance that a fold will be imbalanced is higher. When we tested for homogeneity - the effect was bigger for smaller \(c\).

The conclusion is that when cross-validation is performed for tuning hyperparameters or feature selection it is beneficial to stratify folds for extreme outliers (for \(c >= 3\))

But should we apply this advice when we create folds to make out-of-folds predictions to be used on the ensemble level?
It is an interesting question but before answering it I’d like to look at the somehow related topic of the effect of regularization on the ensembles which is discussed in the lesson two.

The final note of this part is on a technical question

How to create stratified folds in R and Python

Caret package is very popular among data scientists who use R. Caret provides tools to tune model parameters, to cross-validate model, and it includes number of functions to split data on train and validation folds. But when it comes to k-folds splits (e.g. with createFolds function) Caret always tries to balance folds it creates (it does it by treating percentiles as groups and stratify on them). Caret does more than just to stratify by outliers.
But what if such balancing act is unwanted? You should be aware that Caret does not have an option to switch it off. One will need other ways to create folds.

For Python users scikit-learn provides class StratifiedKFold, which is intended to work for classification tasks - as it stratifies by unique values of the target. It is not what we need for regression but it is not that difficult to make StratifiedKFold to stratify by outliers. You can compute a boolean vector which is True only for the outliers of the target. Give this vector as y argument to StratifiedKFold.split() and you will have stratified folds indexes.

Sunday, October 30, 2016

Cross-validation with time-series datasets - part two - Bimbo

Recently Kaggle launched two new competitions with time-series data - Santander Product Recommendation and Outbrain Click Prediction . And as with so many time-series competitions a choice of cross-validation strategy will be decisive factor. It is right time to look at cross-validation for time-series data on an example of a past competition. In the first part of the post we looked at the general approach to time-series competition, on how to create features using time dependencies. And conclusion was that one can create a lot of features, so many features that it will be really easy to overfit, so a solid cross-validation strategy is extremely important. In this part I will use a recent Kaggle competition Grupo Bimbo Inventory Demand to look at possible approaches to cross-validation with time-series datasets.

Bimbo Group supplies over 1 million stores in Mexico with bakery and related products. In the competition Bimbo Group asked participants to predict how much products they should supply to each of their clients(stores). And they gave a lot of data - the train set contained 75 millions rows. We are given 7 weeks of sales of Bimbo to their clients. We are also given other data (such as returns of non-sold products). The full description of the competition and its datasets are on the competition web page. But I’d like to concentrate on cross-validation and omit less relevant details. So for the purpose of this blog I’ve made the following simplifications to the data:

  • the columns were renamed to English equivalents
  • the target is Sales in pesos
    • features like sales in units and returns are ignored
  • the number of categorical feature reduced
  • I do not distinguish between a client and a store
  • the weeks numbering starts from 1

So the head of the modified train set looks like

Week Depo Product Client Sales
1 1110 1453543 1242 45.84
1 1110 9503385 5380 14.82
1 1110 1159580 30571 68.75
1 1110 198780 42434 45.8
1 1110 9503385 37057 15
1 1110 18641 31719 30.36
1 1110 1457784 30571 43.75
1 1110 817243 1146 21.39
1 1110 181232 1309 13.52

It’s just first 8 rows of the train set and in this competition we were given the data set of 75 millions of rows. Which on the one hand looks like a lot but on the other hand - considering that there are almost 900,000 of clients and 2,600 products - it is not really much. For example, when it comes to tuples (Product, Client) the data are sparse and indeed there are more that millions of such tuples which are in the test set and not presented in the train set at all. And the size of the data and this sparsity are important when choosing cross-validation strategy.
But the most important factor when deciding on a cross-validation approach of any competition is how the test set is split to the public and private subsets.

For those who has never taken part in Kaggle competition - a few words about what private/test split means. The final standing of a Kaggle competition decided by how well participants models performed on the test set - for which, obviously, the competitors do not know the target variable. During a competition the preliminary ranking of the participants could be seen on the public leaderboard. This lederboard shows how submitted solutions scored on a portion of the test set. This portion is called public test set. But the final results of the competition will be decided by the score on the private portion of the full test set. And the difference between the preliminary standing on the public leaderboard and the final standing on the private leaderboard might be significant, very significant - participants could lose hundreds of places. For example, only three competitors who were in the top ten by the public LB made it to the final top ten of Rossmann Store Sales competition, which also was a time-series one.

The usual problem is that participants overfit to the public test set.
It helps to avoid overfitting when you know how the test set was split. On many competitions the organizers split the test set randomly - like 30% public portion, 70% private portion. But for time series competitions more natural way of splitting the test set is by time.

And the test set of Bimbo competition was split by time.
In this competition we were given seven weeks of data to train a model and we were asked to predict the sales for the following two weeks.
The competitions organizers clearly stated that test set split about 50%/50% private/public and the split is by time. It means that one of the test set weeks is used for the public leaderboard and the other for the private leaderboard. The organized did not say which week is private which is public. But one can spend just two submissions to easily figure that out. In case of Bimbo the first week of the test set was public and the second week was private.

We want our cross-validation score to tell us how our model will perform on the private leaderboard - and it means that cross-validation arrangements should reflect how the final evaluation of the competition is done. But with time-series data not only cross-validation but the entire approach to the competition depends on how the test set is split.

Why?

Because the final result depends only on how well the model performs on the second week of the test set, for the final standing it is not important how well you predicted the first week. It’s maybe easier to predict the fist week of the test set as it’s closer to the train set - but the final score is decided by the second week. And as I illustrated in the first part we will have less features for the second week of the test set as we just do not have Sales of the previous week, we can use only Sales with lag 2 or more to engineer feature. And it might mean that we should have two different models - one for the first week of the test set and the other for the second week. But when we know that only the second week matters we actually do not need to build model for the first week. But we also should understand that some participants of the competition will do that and they will look good at the public leaderboard.

So we have made a decision to build a model to predict only the second week of the test set. It means that for feature engineering we can only use Sales with lag two and more, we cannot use the immediately preceding week, because we do not have sales for the week immediately before the second week of the test set.

And we should make sure that our validation set is not contaminated with the data which the second week of test set cannot have. And such contamination could be subtle. For example, some common approaches to encoding categorical variables could leak information to validation sets - and the cv score would be overoptimistic.

Ideally we would like our cross-validation arrangements to emulate private/public test split as close as possible. In case of Bimbo it means that we should hold the last two weeks of the train set as the validation set and we will train model on the remaining data and when predicting the holdout set we will have honest, but slighly pessimistic, estimation of of public score and private score. The first week of the validation set is an analogue of the public portion of the test set, and the second week corresponds to the public part of the test set. It’s even possible to estimate the difference between performance of the model on the public and private leaderboards.

But we rarely can do ideal things. There is a question - how much pessimistic is that slighlty. And with this ideal approach it could be too much.
We have 7 weeks of history to train our models. If we hold the last two weeks as a validation set just 5 weeks of the data are left for training. Not that much. Moreover, as we are interested in predicting the second week of the test set we will use engineered features with lag 2 or more. But the first two week of the train set do not have such features so we have reduced our train set significantly. And when it comes to aggregation features - they are not reliable when computed on just 5 weeks or even less.

And in this particular competition it was even more significant as when we look at a tuple (Product, Client) we see that in the seven weeks of the train data a particular product could be delivered to the store only for two, three or often just for one of the seven weeks. So when we withhold two weeks of the train data we lose a lot of information.

As a result cv score could become not just pessimistic but unreliable because the model would be too weak and very different from the one trained on the whole train set.

But do we really need to hold two weeks for validation set? We don’t. As we are really interested in predicting the second week of the train set we can hold for validation just one week - it will mimic the second week of the test set. We keep week 7(the last week of the train set) as the validation set and when we compute features for this week we do not not use Sales of the previous week (week 6) because the second week of the test set doesn’t have this information.

So far we have made three important decisions:

  • we build model to predict only the second week of the test set
  • we hold the last week of the training set as the validation set which mimics the second week of the test set
  • we engineer features using lag 2 and more

One more decision to be made. Which weeks of the train set will we use to train models?

First, look at the table below

Week Product Client mean_Sales mean_Sales_Client mean_Sales_Product Sales_lag_2
3 357 43507 NA 22.41 26.52 NA
3 30415 43507 NA 22.41 25.73 NA
3 35144 43507 20.5 22.41 56.04 20.5
3 30532 43507 38.9 22.41 48.22 38.9
3 36748 43507 NA 22.41 30.31 NA
3 43067 43507 NA 22.41 19.13 NA
3 43069 43507 7.41 22.41 34.46 7.41
3 43285 43507 NA 22.41 70.25 NA
4 357 43507 29.16 24.7 27.04 29.16
4 30415 43507 7.5 24.7 25.96 7.5
4 35141 43507 41 24.7 48.96 NA
4 35144 43507 20.5 24.7 55.77 20.5
4 30532 43507 42.79 24.7 50.29 46.68
4 36748 43507 NA 24.7 30 NA
4 43067 43507 NA 24.7 18.75 NA
4 43069 43507 14.82 24.7 34.62 22.23
4 43285 43507 63.36 24.7 68.89 63.36
5 357 43507 29.16 27.57 27.28 29.16
5 30415 43507 11.25 27.57 25.67 15
5 35141 43507 41 27.57 49.05 NA
5 35144 43507 20.5 27.57 55.01 20.5
5 30532 43507 51.87 27.57 49.96 70.02
5 36748 43507 11.53 27.57 29.79 11.53
5 43067 43507 19.26 27.57 18.63 19.26
5 43069 43507 12.35 27.57 33.81 7.41
5 43285 43507 84.48 27.57 68.73 105.6
6 357 43507 29.16 26.33 27.12 29.16
6 35141 43507 30.75 26.33 49 20.5
6 35144 43507 20.5 26.33 54.2 20.5
6 30532 43507 46.68 26.33 49.85 31.12
6 43285 43507 73.92 26.33 68.63 52.8

the table illustrates some points about the choices we discussed above. I do not include weeks 1 and 2 as all the features computed with lag of two weeks are empty. And wee see that for week 3 a lot of features are empty as well. And all aggregation features for week 3 were computed just for one week - so they are not really aggregations. In this particular competition and usually with time-series data - central tendencies features like mean or median are the most powerful. But they need enough data-points to become reliable. It means that weeks 3,4 are just do not have robust features.

So we are left with weeks 5,6,7 for final models, and with week 5,6 for cross-validation models.
We can train model on weeks 5,6,7 , or 6,7 or just 7. We can do all those combinations and than use the predictions on the ensemble level.
Still we cannot reliably compare performance of the model trained on weeks 5,6,7 and the model on week 6,7 using just cross-validation.

But in this case a feedback of public leaderboard could be very helpful.
I would like to elaborate on that.
I already mentioned that public LB is based on the first week of the test set - and this week is easier to predict than the second as it is closer to train set and we have more information to train a model. But there is not much sense to look good at the public LB but fail miserably at the private LB.
What we could do instead - is to train a separate model for the first week of the test set - which does not use the last week of the train set at all. Yes, we throw away a whole week of the train set but we treat the first week of the test as the second. So score on the public leaderboard will a become much better estimate of the private score, and also now we can evaluate how well our CV procedure works and adjust it if we see problems.

Summing everything up we have the following cross-validation procedure for Bimbo competition:

  • We keep the last week of train set(week 7) as validation set.
  • We compute features for the validation set and for training set with lag 2 or more (the same condition as for private part of the test set)
  • We train model for validation using week 6 (and features based on weeks 1-4). We make prediction and compute score for week 7. We use the same model to predict the first week of the test set(week 8).
  • For the second week of the test set(week 9) we build a new model, this time using all the data of the train set but still all lagged features are computed with lag 2 or more. We train this model on the last week of the train set.

With this approach we have reliable cross-validation plus public LB score becomes a good proxy for private LB score.

Such approach is conservative, and CV score would be a little bit pessimistic - and there might be reason when you want to be more aggressive. But it is not the absolute cross-validation score of a model which matters it is that the CV score moves in the same direction as the test score - and with this approach and using LB set as additional validation one is much less likely to overfit.

It has been a long post but having participated in several competitions with time-series data I have learnt that it is much more effective to invest time, run some experiments to get a solid cross-validation strategy at the very beginning of any time-series competition. Feature engineering, hyper-parameters tuning, stacking will work only if you can rely on your cross-validation score.

P.S.
Cross-validation approach described in the post works well but mostly for one-level models. For any ensemble which uses something more complex than a weighted average cross-validation becomes much more complex. And cross-validation strategy for ensembles is a huge topic on itself.