In November 2016 Kaggle Allstate Claims Severity competition finished. Unusually for a recruiting competition it turned out to be big - more than 3000 participants, top kaggle guns, very close battle till the very end.
But it is not about number of participants - it is about what you learned by participating. I learned a lot during the competition and this post is on the first lesson.
Cross-validation in the presence of outliers
First, a few words about the competition itself. The organizers provided a dataset of 130 fully anonymized features, about 200,000 observations. It was a regression problem - the target was a positive number called loss
. The metric was MAE
- mean absolute error.
A typical ensemble battle one would predict. And indeed it was won by an ensemble at the end, but still there were many twists and interesting observations on the way to the finish line.
Let’s look at the distribution of the target variable.
The distribution is very skewed, high proportion of outliers, many extreme outliers.
I am using here Tukey’s test to detect outliers. According to it - an observation of variable y is considered an outlier if y>Q3+c∗IQR or y<Q1−c∗IQR, where Q1 is the first quartile, Q3 is the third quartile, and IQR is interquartile range (Q3−Q1). c=1.5 indicates an outlier, and with c>=3 an outlier considered to be extreme.
The boxplot above marks outliers(for c=3) as red dots. We see the all outliers are on the right side. For this particular competition the target had low bound of zero.
From here I will be looking only to the outliers which are above Q3+c∗IQR
c | n_out | frac_out |
---|---|---|
1.5 | 11554 | 0.061 |
2 | 7638 | 0.041 |
3 | 3431 | 0.018 |
4 | 1632 | 0.009 |
5 | 823 | 0.004 |
6 | 459 | 0.002 |
The table above shows number of outliers(n_out
) and proportion(frac_out
) of outliers in the train dataset for different values of c
There are several different strategies how to deal with outliers when you train a model but in this post I look only at the question how to do cross-validation for regression tasks with the presence of outliers.
Stratify or not?
The main question I’m trying to answer - should be validation sets stratified by outliers?
First, when stratification is usually applied?
For classification tasks stratification is performed to ensure that proportions of the target classes in the validation folds are the same as in the whole population. It is especially important when the classes are not well balanced.
Following the same logic one wants to preserve proportion of outliers in train and validation folds. But still it is not that obvious that it is important - we are not predicting whether an observation is an outlier, we are doing regression.
The goal of cross-validation is to test and tune a model - and we want the improvements to the model found with cross-validation to generalize to the real test set. A common approach to tuning a model is to use k-fold cross-validation and run grid or random search to find the best set of hyperparametrs. For each run we get values of the loss function for each of the folds, we compute the mean of the loss across the folds, and we use this value to select the best set of the parameters. But the loss for different folds varies, and we should be sure that improvement we observe are significant, they are to changes to the model, not just to the noise introduced when we created the folds.
It is very desirable to reduce variance of the loss induced just by the way we split the train set to the folds, and if stratification reduces the variance we should use it.
To find out whether stratification really reduces the variance let’s run a statistical test. The hypothesis is that by creating k-folds stratified by outliers we will increase homogeneity of the folds.
How do we measure the homogeneity?
We will compute difference between a validation fold mean and the mean of the corresponding train folds and will use standard deviation of this differences across the folds as the test statistic.
For example, for 5-folds validation:
The general population is the set of all possible 5-folds splits of the train set. One unit of this population is one split to 5 folds.
As the test statics we measure the following number - for each validation folds (and we have 5 of them) we compute the distance between mean of the target of this fold and of the 4 remaining folds(the train folds). We will have 5 numbers and the test statistic is standard deviation of this 5 numbers. (10 in the case of 10-folds splits).
I have chosen this statistics because we are interested not in the distances themselves but how much they vary for different folds - the more the variation the more the noise which was added just by the splitting.
The formal hypothesis statement is:
The null hypothesis(H0) is that stratification has no effect of the test statistic defined above and the alternative hypothesis(H1) is that stratification will reduce it.
The general population is the all possible splits of the train set to k-folds (for k in {5,10}).
The affected population is the set of k-folds stratified by outliers.
We are going to perform test for 5-folds and 10-folds settings, and for different c of deciding whether observation is an outlier for c in {1.5,2,3,4,5}.
The level of significance (α) - 0.05.
First we need to find out the mean of the general population. To do that I generated 1000 k-folds splits and compute the mean of the test statistic.
To compute the test statistic for stratified folds I generated 200 splits for each combination of k and c.
The results are in the table below
k(n_folds) | c | expected | observed | diff | se | p_val |
---|---|---|---|---|---|---|
5 | 1.5 | 8.048 | 5.379 | -2.668 | 0.1609 | 4.458e-62 |
5 | 2 | 8.048 | 5.904 | -2.144 | 0.1788 | 2.048e-33 |
5 | 3 | 8.048 | 6.343 | -1.705 | 0.1971 | 2.635e-18 |
5 | 4 | 8.048 | 7.059 | -0.9885 | 0.2246 | 5.383e-06 |
5 | 5 | 8.048 | 7.043 | -1.005 | 0.2083 | 7.057e-07 |
10 | 1.5 | 12.3 | 8.327 | -3.974 | 0.1721 | 2.961e-118 |
10 | 2 | 12.3 | 8.501 | -3.8 | 0.1747 | 3.128e-105 |
10 | 3 | 12.3 | 9.947 | -2.353 | 0.1983 | 8.569e-33 |
10 | 4 | 12.3 | 10.61 | -1.687 | 0.2068 | 1.729e-16 |
10 | 5 | 12.3 | 10.92 | -1.385 | 0.205 | 7.105e-12 |
The results are significant.
We can conclude that stratification by outliers makes validation folds more similar to the train folds - which is desirable.
Still we are really interested in the question how splitting affects model tuning. Could we measure this effect directly?
We could. Let’s do a similar test but this time the test statistic will be computed as the standard deviation of the performance of a model.
We will do that for two learners. The first is the standard least-squares regression (as implemented with LinearRegression
class of scikit-learn) and the second is a more robust to outliers HuberRegression
. The loss function is MAE
- mean absolute error.
The null hypothesis is the same - the stratification does not affect variance of the loss of validation folds.
The alternative hypothesis - stratification reduces it.
We will run the test for 5 folds splits and for different c in {1.5,2,3,4,5} of outliers definition
The general population is the same, and the affected populations are the same. Only thing that was changed is the test statistic.
To compute mean of the test statistic for general population (all possible k-folds) I first trained a learner for 200 times on k-folds (k*200 runs in total) and computed standard deviation of loss on each of k-folds. It gave us two samples of the size 200 - their means are good estimations of the population means.
To compute the test statistic for affected populations we will have a sample of the size 50 for each combination of a learner and c. We are going to do that only for 5-folds splitting.
learner | k(n_folds) | c | expected | observed | diff | se | p_val |
---|---|---|---|---|---|---|---|
LinearRegression | 5 | 1.5 | 6.602 | 5.941 | -0.6612 | 0.2646 | 0.00623 |
LinearRegression | 5 | 2 | 6.602 | 5.935 | -0.667 | 0.3085 | 0.01532 |
LinearRegression | 5 | 3 | 6.602 | 5.084 | -1.518 | 0.2595 | 2.434e-09 |
LinearRegression | 5 | 4 | 6.602 | 5.246 | -1.357 | 0.2675 | 1.98e-07 |
LinearRegression | 5 | 5 | 6.602 | 5.465 | -1.138 | 0.2421 | 1.3e-06 |
HuberRegressor | 5 | 1.5 | 9.526 | 7.94 | -1.586 | 0.3521 | 3.335e-06 |
HuberRegressor | 5 | 2 | 9.526 | 7.732 | -1.794 | 0.4157 | 7.96e-06 |
HuberRegressor | 5 | 3 | 9.526 | 7.502 | -2.025 | 0.4148 | 5.294e-07 |
HuberRegressor | 5 | 4 | 9.526 | 8.22 | -1.306 | 0.4968 | 0.004285 |
HuberRegressor | 5 | 5 | 9.526 | 9.121 | -0.4056 | 0.4908 | 0.2043 |
Interestingly we see that effect is more pronounced for bigger c. With c in {3,4,5} we have fewer outliers but they are extreme and the chance that a fold will be imbalanced is higher. When we tested for homogeneity - the effect was bigger for smaller c.
The conclusion is that when cross-validation is performed for tuning hyperparameters or feature selection it is beneficial to stratify folds for extreme outliers (for c>=3)
But should we apply this advice when we create folds to make out-of-folds predictions to be used on the ensemble level?
It is an interesting question but before answering it I’d like to look at the somehow related topic of the effect of regularization on the ensembles which is discussed in the lesson two.
The final note of this part is on a technical question
How to create stratified folds in R and Python
Caret package is very popular among data scientists who use R. Caret
provides tools to tune model parameters, to cross-validate model, and it includes number of functions to split data on train and validation folds. But when it comes to k-folds splits (e.g. with createFolds
function) Caret
always tries to balance folds it creates (it does it by treating percentiles as groups and stratify on them). Caret
does more than just to stratify by outliers.
But what if such balancing act is unwanted? You should be aware that Caret
does not have an option to switch it off. One will need other ways to create folds.
For Python users scikit-learn provides class StratifiedKFold
, which is intended to work for classification tasks - as it stratifies by unique values of the target. It is not what we need for regression but it is not that difficult to make StratifiedKFold
to stratify by outliers. You can compute a boolean vector which is True
only for the outliers of the target. Give this vector as y
argument to StratifiedKFold.split()
and you will have stratified folds indexes.