Friday, February 24, 2017

Data augmentation of convolutional feature maps


Image augmentation is a powerful technique allowing to improve the performance of an image classification system. With image augmentation, you can extract more information from a dataset, which is especially important when your dataset is not big enough. Some could argue that you never have too much data. Even with such big datasets as of Imagenet large scale visual recognition challenge winners of the competition augment data during the training and prediction phases. One could conclude that you should always apply image augmentation.
Yes, you should - when it is feasible. Image augmentation has its cost, and the cost could be high.

Transfer learning is another powerful technique allowing to achieve strong classification performance even with small datasets. Even when you have a larger dataset using pre-trained networks considered to be the default first step which will save you a lot of computation time.
It took weeks to train VGG network, and now we can get their pre-trained model for free and use its weights as a starting point and fine-tune them or very commonly just use the output of the last convolutional layer.
It has been shown that convolutional feature maps of model pre-trained on Imagenet like VGG, or AlexNet, or whatever you choose from Caffe zoo could be used on a range of applications even for images which are not similar at all to the ImageNet dataset.
I also should point out that even though most of the pre-trained model were trained on smaller resolutions (like 224x224) you can apply them to high-resolution images if you replace the fully-connected top layers with convolutional layers.

We have two powerful techniques - image augmentation and transfer learning - could we use them together?
In theory - yes.
But both of the techniques have associated cost which you pay in time and resources, and the combined cost of using both could be prohibitive.
It is especially true when you need to use high-resolution images without down-sampling. For example, you need to recognize objects which are not dominant, or you perform a fine-grained classification.

For high-resolution images just applying pre-trained network  takes a lot of time.
If for every image of your train set for every epoch you pass the image through all the layers of the pre-trained network and then your layers it could be very slow.
A common technique to speed thing up - is to apply a pre-trained network once on your images and save the output of the last convolution layer. Now the saved feature maps are input to your model.
Then you train the network which is specific to your task.
It is much, much faster - two orders of magnitude faster.
But now you cannot augment the original images anymore. You work with feature maps, not images.

Could we somehow use the fact that space of feature maps and space of the original image are related? The top-left corner of an image map roughly corresponds to the top-left area of the original image. For example, in the case of VGG  the top-left "pixel" of the last convolutional layer feature map corresponds to the top-left 16x16 rectangle of the image.

A common transformation applied on image level is to take a random crop. And we can crop feature maps. The resolution of image map is much lower - but  10-20% of crop on the image level would correspond to crops of 2-3 feature map "pixels".
Random horizontal flip is another common transformation, and we can flip feature maps. Technically such transformations of feature maps are possible and easy to implement.

But the real questions is - do they useful? Could we replace image level transformations with feature map transformations and achieve a similar effect?

There is a very interesting paper discussing spatial transformations of feature maps in details.

However, the goal of this post is not a theory I just would like to apply image augmentation and feature map transformation on a dataset, and compare performance and running time.

I will train 3 models on very small subsets Kaggle's Dogs&Cats dataset.
And I mean very small - we will start with a train set of just 10 samples(5 dogs and 5 cats), then 20, 40, 80 samples. I will use pre-trained VGG to produce convolutional feature maps and train a small fully convolutional network on top of it.

You can look at the Augmenting Feature Maps notebook to find the details, the code, the architecture of the model.

Here we go straight to the results

n_samples model accuracy logloss sec/epoch
10 baseline 0.735 0.476 0
10 feature maps augmentation 0.853 0.363 0
10 image augmentation 0.806 0.395 11
20 baseline 0.908 0.296 0
20 feature maps augmentation 0.915 0.264 0
20 image augmentation 0.932 0.257 11
40 baseline 0.929 0.236 0
40 feature maps augmentation 0.933 0.221 0
40 image augmentation 0.932 0.216 12
80 baseline 0.941 0.199 0
80 feature maps augmentation 0.941 0.206 0
80 image augmentation 0.934 0.200 14


What is amazing is the power of pre-trained models, with VGG features and data augmentation a very simple model achieves  reasonable performance trained on just 10 samples, and very good performance on just 40 samples.
We also see that data augmentation helps a lot with the 10 samples, helps with 20 samples subset and then we observe diminishing return - with 40 samples the baseline model is catching up, with 80 the performance is practically the same.
We also see that training with image augmentation is much slower, and there is no slowdown with feature map augmentation.

Still it is very important to point out that there are more possible augmentations which you can apply on the image level. You could augment color, lightening conditions, you can do more elaborate spatial transformations.  Such augmentation cannot be done on feature maps level.

The obvious benefit of feature map transformation is that you can boost performance with very little cost. Also such transformations could be implemented as a custom layer of a network.

It would be interesting to try feature map transformation on a more challenging dataset. Which I will do soon and share the results here.

P. S.
For those who would like to find more about data augmentation and transfer learning I would recommend the following papers:

Spatial Transformer Networks
How transferable are features in deep neural networks?
Some improvements on deep convolutional neural network based image classification.