Notes for CNN course by deeplearning.ai (week 2)

These are my notes for week 2, focusing on case studies.

Video: Why look at case studies?

Look at classic CNN is a good way to gain intuition about how CNNs work. Also, in computer vision, successful architecture for one task is often good for another task.

Classic CNN examples looked at this week:

Video: Classic Networks

LeNet-5 (1998)

Introduced to recognize black&white hand-written digits (32x32x1). The architecture is described below. Note that at the time of the paper, padding was not common. So all convolutional layers shrink the size of the image.

All non-linearities were tanh or sigmoid. Also, they applied activation functions after the pooling layers. Total network had about 60,000 parameters, which is small by today’s standard (>1M parameters). However variation in image size through the network remains current: height and width go down, number of channels go up. Another aspect that remains often true today is the alternance of the layers: conv, pool, conv, pool, fc, output. A last note, they had a different way of apply the kernels to the input image (i.e, didn’t have kernels with the same number of channels as the input).

AlexNet (2012)

The architecture is described below. The input image is 227x227x3.

Many similarities with LeNet-5, but AlexNet has about 60M parameters. Also, they used ReLU activation functions. They also used another type of layer called Local Response Normalization, which normalizes each pixel/position across all channels. This type of layer is not really used anymore as it was found that it had a very small impact on the results.

VGG-16 (2015)

The architecture can be visualized here. VGG-16 really simplified the neural networks architecture. It only relies on a single type of convolutional layer (3x3, stride=1, same padding) and a single type of max pooling (2x2, stride=2). The input is 224x224x3.

VGG-16 has about 138M parameters, which is large even by today’s standards. 16 is the number of layers. There is also a VGG-19, but performance is comparable. VGG is attractive to the community b/c it simplifies the construction of the network by systematizing the change in height/width and channels from one layer to the next.

Video: ResNets (2016)

ResNets was introduced to help training of very deep neural networks. In practice, we notice that training error for plain network will start to increase after a certain depth is reached. ResNets allows to train networks of 100 or 1,000 layers deep and still see the training error gradually decrease as the depth of the network increases. ResNets helps with the problems of vanishing and exploding gradients.

Elementary block of a ResNet is a Residual Block, which adds a skip connection (or shortcut) to typical fully-connected layers. In a given layer, after the linear part but before the application of the ReLU activation function, the value of layer output a few levels below is added, eg, $a^{[l+2]} = g(z^{[l+1]} + a^{[l]})$.

Video: Why ResNets work?

Uses an intuitive example to try and show why ResNets work. Intuition is that ResNets can very easily learn the identity map (W=0, b=0). That means, it’s easy for a ResNets layer plugged after an existing DL to do at least as well. Then after training, it’s easy to imagine that the addition of the ResNets will improve performance.

Something else to notice about ResNets is that you need to have $z^{[l+2]}$ and $a^{[l]}$ that have same dimension. For that reason, you often see people using the same convolution (or same padding). However, this is not mandatory, as one can add a linear transformation of $a^{[l]}$ so that its dimension match with $z^{[l+2]}$. That linear transformation can be learned or fixed (eg, padding,…).

Video: Networks in Networks and 1x1 Convolutions

In the paper Network in Network, they discuss the use of a 1x1 convolution. It’s interesting in the case of a multi-channel image, where it takes a linear combination of each channels (before passing it through a non-linear activation function).

This idea of 1x1 convolution can be used to shrink the number of channels of an image w/o altering the height or width. altering

Video: Inception Network Motivation

Main motivation is that instead of choosing amongh a 1x1 convolution, a 3x3 convolution, a 5x5 convolution, or a max-pooling layer, why don’t you just do them all in a single layer. Then stack all the outputs along the channel dimension. That’s what the inception layer is about. You of course need same padding to preserve the height and width and be able to stack them along the channel dimension.

Problem of that approach is the computational cost. For instance, 5x5 conv with same padding, going from 28x28x192 to 28x28x32 involves about 120M fp operations. A solution is to introduce a bottleneck layer, that is 1x1 convolution that reduces the number of channles (eg, from 192 down to 16). Then apply the 5x5 convolution on that output with a reduced number of channels. You can reduce the computational cost by 1/10th.

Video: Inception network

A visualization of the architecture of the inception network can be found here.

Rough idea is as descrbied in the previous section: apply 1x1, 3x3, 5x5 convolutions, and max-pooling in parallel, the concatenate all the channels together. To reduce computational cost, 3x3 and 5x5 convolutional layers are preceded by a bottleneck layer, and to avoid having max-pooling dominate the output, it is followed by a bottleneck layer. All of this represents an inception module. The inception network is more or less a stack of multiple inception modules.

A noticeable addition is that you find intermediate probes in the network that try to predict the outcome with a sequence of fully-connected layers followed by a softmax. The rationale behind that is to try and force the network to regularize itself by trying to do these intermediate predictions.

Important note: Inception was the architecture used to introduce batch norm. A few other modifications were made to the network though (learning rate, no dropout,…).

Video: Using open-source implementation

github

Video: Transfer Learning

At least in computer vision, transfer learning should always be considered unless you have an exceptionally large dataset. Transfer learning ranges from: i. keeping all layers frozen and replacing/training the output layer (eg, softmax layer) ii. keeping a few layers frozen (if you have more data) iii. re-training the entire network, starting from the trained model as initialization.

Video: Data augmentation

In most computer vision applications today, you never have enough data. Data augmentation can help “get” more data.

A few common methods:

Another type of data augmentation is color shifting, where you add random perturbations to the RGB channels. Also, there is a “PCA color augmentation” (see AlexNet paper).

In terms of computational efficiency, data augmentation can be done a separate CPU thread, before passing the augmented data to the CPU/GPU used for training, which is completely paralelizable.

Note that data augmentation most often comes with a set of hyparameters that also need to be selected.

Video: State of Computer Vision

Less data means you’ll need more hand-engineering to get good results.

Other tips:

[ cnn  course  deeplearning  ]