Notes for CNN course by deeplearning.ai (week 4)
08 Feb 2021This week focuses on Face Recognition and Neural Style Transfer.
Video: What is face recognition?
Verification vs Recognition? Verification means you have a tuple (image, name) and you must return whether they correspond to the same person or not.
Recognition means you have K persons in your database and given an input image you try to detect whether that image belongs to one of the K persons in your database.
You actually need very high accurate verification systems to be able to do recognition systems. As you are basically running a recognition K times. So you have K more chances to make an error. This is why recognition is typically considered a much harder problem.
Also, face recognition is one-shot learning problem. You typically have a single example of each face, and you want to train a face recognition algorithm from that.
Video: One Shot Learning
The challenge with face recognition is that you have very little samples (1 per individual). That is a one shot learning problem.
You cannot solve it as a typical object/face recognition problem. Instead what you can do is train a network to return a similarity function between a given input image and each of the individuals you have in your database (+ category “unknown”). And you can decide of a threshold under which you declare a match between the input and a person in your db.
Video: Siamese Network
You can define 2 identical networks (same parameters -> siamese)) that will each take a face image as input, pass it through a set of conv nets, then fc layers, then output a vector (say, 128-d vector). Then you can define similarity between both faces as $| f(x_1) - f(x_2)|^2$
Video: Triplet Loss
Loss is defined from a triplet of 3 images. Using the notion of margin (support vector margin), you define a positive float $\alpha$, then you want to have $| f(A) - f(P)|^2 - | f(A) - f(N)|^2 + \alpha \leq 0$, where A=anchor, P=positive, N=negative. Actually, since we don’t need this to be extremely negative, we just want this to less than zero. So to avoid having the network pushing on a corner case, we take $max(|…. + \alpha, 0)$.
To train that system, you need to have multiple faces per person (so not really one shot….). Then when selecting the triplet, you want to select triplet such that the 2 norm diff are actually close to each other. So that it forces the network to learn something. That is to say, you don’t pick A, P, N at random.
Video: Face Verification and Binary Classification
Face recognition can also be posed as a binary classification problem, by taking the same idea of 2 siamese networks, then passing both 128-d output vectors to a logistic regression unit (maybe passing the absolute value of the entry-wise difference between both 128-d vectors). There are also variants.
Notice that you can pre-compute the encodings for all people in your database, so that you only have to compute the encodings for the face you are trying to recognize (at inference time). This will speed up computation time.
Video: What is neural style transfer?
To implement style transfer, you need to understand (and use) the features extracted by the different layers of a convnet. But for that, you need to understand what is being extracted.
Video: What are deep ConvNets learning?
Typically, layers detect from coarser to finer details as you go deeper into the convnet.
Video: Cost Function
Cost function is the sum of a content cost function and a style cost function, both measuring the similarity between the generated image and the content image or style image.
The image is generated by applying gradient descent directly to the generated image, starting from a random image.
Video: Content Cost Function
The content is defined as the squared norm of the difference between the activation of both content (input) image and generated image, at a given layer of a pre-trained ConvNet. The choice of what layer activation you are using is decided by how much granularity you want to preserve in the generated image.
Video: Style Cost
As for the content cost function, you pick a layer from your ConvNet, then work with the activation function of that layer.
The style of an image is defined via its Style matrix, which is an $n_c \times n_c$ matrix where each entry is the point-wise multiplication of all activations of each channel. In the video, they call that the correlation between both channel; that’d be true if the channels all had zero means. \(G_{k,k'} = \sum_i \sum_j a_{ijk} a_{ijk'}\).
You calculate the style matrix for the style image and the generated image. The style cost function is defined as the squared Frobenius norm of the difference between both style matrices. Except that there is a normalization constant in front (but there is another multiplicative factor in front of the cost function).
Actually, the overall style cost function is a weighted sum of the style cost function of each individual layer.
The overall cost function is the weighted sum of the content cost function and the style cost function. Then you minimize your cost function (gradient descent, or other optimizer) by changing the generated image.
Video: 1D and 3D Generalizations
Show how convolutional layers can be used on 1d and 3d data, not just 2d data.
[cnn
course
deeplearning
]