Notes for CNN course by deeplearning.ai (week 1)

I decided to take the course Convolutional Neural Networks by deeplearning.ai and hosted by Coursera. And I’m going to take some notes in this document

Week 1: Foundations of Convolutional Neural Networks

Objectives:

  • Explain the convolution operation
  • Apply two different types of pooling operations
  • Identify the components used in a convolutional neural network (padding, stride, filter, …) and their purpose
  • Build and train a ConvNet in TensorFlow for a classification problem

Video: Computer Vision

Example of vision problems:

  • Image classification: decide what picture represents (given set number of categories)
  • object detection: draw boxes around specific objects in a picture
  • neural style transfer: content image + style image -> content image with the style of the style image

Challenges: Input can be really big: 64x64x3 images -> 12288 pixels; very small. Large images can get really big very quickly. 1000x1000x3 = 3M pixels. With a first layer of 1,000 nodes, you end up with 30M weights just for the first layer. With so many parameters, hard not to overfit; plus need lots of memory.

To still be able to use large images, you need to use CNN.

Video: Edge Detection Example

CNN based off the convolution operation. Illustrated by the edge detection example. Another interest ref for convolution is Colah’s blog.

How do you detect edges in an image? Show an example of a 6x6 matrix convolved with a 3x3 kernel (or filter), giving a 4x4 image. You just slide the kernel over the image and at each step multiply the entries pointwise then sum them all. Similar to what you would do with the actual operator, \(f \star g = \int_{-\infty}^{\infty} f(\tau)g(t-\tau) d\tau\), where for each $t$ the function $g$ (the kernal, or filter) is shifted by an amount $t$.

Actually, this mathematical definition is a little bit different as the kernel is mirrored around its axes (we have $g(t-\tau)$ as a function of $\tau$). This is discussed in the section on Video: Strided Convolutions.

Don’t need to re-implement a convolution operator: python -> conv_forward, tensorflow -> tf.nn.conv2d, keras -> cond2D

Why is this doing edge detection? Look at picture with 2 colors separated by a vertical edge. Apply 3x3 kernel: 1,0,-1 (in x direction; same in y). Obtain a thick edge, in your 4x4 image, in the middle.

Video: More Edge Detection

Introduces more examples for filters, for instance for horizontal edge detection. But also different types of vertical edge detectors: eg, 1,0,-1//2,0,-2//1,0,-1 (Sobel filter) or 3,0,-3//10,0,-10//3,0,-3/ (Scharr filter). But you can parametrize the values of your kernel, and learn the ideal kernel for your problem. And that is a key idea of modern computer vision.

Video: Padding

Discrete convolution presented shrink the image, so you could only apply it a few times. Initial image n x n, kernel f x f, then convolved image $n-f+1 \times n-f+1$.

Also, pixels close to the edge of an image are used much less in the convolution compared to central pixels.

Solution for both problems: pad the image before applying convolution operator, i.e., add pixels on the outside of the image: eg, 6x6 ->(padding) 8x8 ->(convolution w/ 3x3 kernel) 6x6. Padded convoluted image has dim $n+2p-f+1 \times n+2p-f+1$. By convention, you pad with 0’s.

How much to pad? Valid convolution vs Same convolution

  • Valid convolution = no padding
  • Same convolution: pad so as to offset the shrinking effect of the convolution (final image has same dim as input image) -> $2p = f-1$

Note that by convention, in computer vision, f is typically an odd number: 1 (less common), 3, 5, 7. That has the advantage that you can have a symmetric padding.

Video: Strided Convolutions

Stride = by how many pixels you shift the kernel (vertically and horizontally). So the convolved image will be smaller with stride 2 than with stride 1. Resulting image side length: $(n+2p-f)/s+1$ (potentially rounded if needed)

Cross-correlation vs convolution:
As discussed earlier, the actual convolution operator would require to flip the kernel around its axes, since we do \(\int_{-\infty}^{\infty} f(\tau) g(t-\tau) d\tau\). But in deep learning, people don’t flip it. So in that sense, this is closer to the cross-correlation operator whic is defined, for real functions, as \(\int_{-\infty}^{\infty} f(\tau) g(t+\tau) d\tau\). But as explained in the video, the convention is such that people in deep learning still call that a convolution operator.

Video: Convolutions over volumes

Application: RGB images (3 color channels).
Convolve RGB image with a 3d kernel (same number of channels).

Notation: Image 6 x 6 x 3 = (height, width, channels). Number of channels in image and kernel must be equal.

Really this is 2d convolutions repeated over a stack of 2d images. Not really a 3d convolution. You could imagine having an image of dim 6 x 6 x 6, and convolve along the third dim also.

Multiple filter: You can apply different kernels to the same image(s) and stack the results to generate also a 3d output. For intance, you could combine the output of a vertical edge detector and horizontal edge detector.

Dim summary when using nb_filters_used filters on an image of c channels: image = n x n x c, kernel = f x f x c, then output = n-f+1 x n-f+1 x nb_filters_used

Video: One layer of a convolutional layer

Convolutional layer = activation_function (multiple filters–with the right number of channels– + a bias), where bias is a single real number. So output has same dimension as the example in the previous video. Don’t forget that there is a single bias per filter.

And number of parameters is equal to (dim of kernel + 1) x nb_kernels. And this number of parameters is independent of the dimension of the input!

Notations for layer $l$:

  • $f^{[l]}$ = filter size
  • $p^{[l]}$ = padding
  • $s^{[l]}$ = stride
  • input: $n_H^{[l-1]} \times n_w^{[l-1]} \times n_C^{[l-1]}$
  • output: $n_H^{[l]} \times n_w^{[l]} \times n_C^{[l]}$, where \(n_{H/W}^{[l]} = (n_{H/W}^{[l-1]} + 2p^{[l]} - f^{[l]}) / s^{[l]} + 1\).
  • And $n_c^{[l]}$ is equal to the number of filters used in the convolutional layer.
  • Each filter has dim $f^{[l]} \times f^{[l]} \times n_c^{[l-1]}$ as the number of channels in the kernel has to be equal to the number of channels in the input image.
  • Weights: $f^{[l]} \times f^{[l]} \times n_c^{[l-1]} \times n_c^{[l]}$
  • Bias: $n_c^{[l]}$

Note: ordering Batch, Height, Width, Channel is not universal as some people will put channel after batch.

Video: A simple convolutional network example

Basic idea is to stack convolutional layers one after the other. At the end, you can flatten the last image and pass it to an appropriate loss function.

One general trend is that the image dimension is going down as we progress in the net, while the number of channels go up.

Basic convolution network includes: convolutional layers, pooling layers, and fully connected layers.

Video: Pooling Layers

Pooling layers are often found in convolutional networks. The intuition behind these layers is not well understood, but they are commonly used as they seem to improve performance.

The idea is to apply a filter (or kernel), the same way we did it for convolution layers. That is a pooling layers has a size $f$ and a stride $s$. But instead of applying a convolution kernel for every step of the filter, we instead apply a single function. In practice, we find max pooling (most common) and average pooling (not as common). For max pooling for instance, at every step of the filter, we take the max within that filter. We sometimes find average pooling at the end of a network, to compress an image into a single pixel. Interestingly, pooling layers have no parameter to learn (only hyperparameters).

When the input image has multiple channels, we apply the pooling filter to each layer. Most commonly, we don’t apply padding with pooling layers.

Also, it is important to realize that padding, in the context of pooling layers, does not have the same meaning as for convolutional layers. In the context of pooling layers, same padding means that the picture will be enlarged such that the kernel can be applied an exact number of times. Whereas with valid padding, the kernel is applied as many times as it can fit. In other words, even with same padding, the output image does not necessarily have the same dimension as the input. For instance,

A = tf.ones((1, 14, 32, 1))
P1 = tf.nn.max_pool(A, ksize=[1,8,8,1], strides=[1,8,8,1], padding='SAME')
P2 = tf.nn.max_pool(A, ksize=[1,8,8,1], strides=[1,8,8,1], padding='VALID')
with tf.Session() as sess:
    print(f"P1.shape = {sess.run(P1).shape}")
    print(f"P2.shape = {sess.run(P2).shape}")

will give

P1.shape = (1, 2, 4, 1)
P2.shape = (1, 1, 4, 1)

This post also explains it well.

Video: CNN example

Introduce an example of CNN to do digit recognition on a 32 x 32 x 3 RGB image. The presented CNN is inspired by LeNet-5.

  • convolutional layer 1: 6 filters with hyperparameters $f=5, s=1, p=0$
  • max pooling layer 1: $f=2, s=2$.
  • convolutional layer 2: 16 filters with hyperparameters $f=5, s=1, p=0$
  • max pooling layer 2: $f=2, s=2$.
  • fully-connected layer 3: flatten output of max pooling layer 2 -> fc3 = 400 x 120
  • fc4: 120 x 84
  • Softmax layer (I guess fc5 + softmax): 84 x 10 (10 digits to recognize)

We notice that throughout the layers, the image decreases while the number of channels increase. Also, we notice the alternance conv layer / pool layer a few times, followed by a few fully-connected layers.

Video: Why convolutions?

What makes convolution layers efficient for image recognition?

  • weight sharing: you work on the input image with limited number of parameters
  • sparsity of connections: each output pixel only depends on a few of the input pixels
  • Conv layers are believed to be good at preserving translation invariance.

Assignments

I put my assignments on github.

Additional references

This blog post has very nice visualizations of what convolutional layers do.

Supplementary note

When you have a dilated convolution, the dimension of the output becomes \(int \left( \frac{N+2p-d*(f-1) - 1}{s} \right) + 1\).

First steps with rpy2

How to install in Docker

I haven’t found a way to install it via pipenv, so I added the following lines directly to my Dockerfile

RUN apt-get update && apt-get install -y r-base
RUN pip install rpy2

Yes, don’t forget to install R before installing rpy2.

How to install and load a R package

Default package should come with your R installation so that you can direclty load them by doing

my_r_package_loaded = rpackages.importr("my_r_package")

You can then use your loaded R package like a module. For instance,

my_r_package_loaded.this_amazing_script(var1=1.0, var2='hello', var3=3.4)

But for specific packages, like here the TOSTER packace, you can do this

from rpy2 import robjects
import rpy2.robjects.packages as rpackages

utils = rpackages.importr('utils')
utils.chooseCRANmirror(ind=1)
utils.install_packages("TOSTER")

How to run a R script in a notebook

Given the script is in the same directory as your notebook, you can do

import rpy2.robjects as robjects
r = robjects.r
r.source('my_first_r_script.r')

How to plot from R using rpy2

import rpy2.robjects as robjects
from rpy2.robjects.lib import grdevices
from IPython.display import Image, display

with grdevices.render_to_bytesio(grdevices.jpeg, width=1024, height=896, res=150) as img:
    r.source('my_first_r_plot.r')
    
display(Image(data=img.getvalue(), format='jpeg', embed=True))

More on the plotting can be found in this demo notebook

Fixing jekyll

After upgrading to Catalina, I could not run jekyll locally to render that website on my laptop. I was getting a weird error message about the ruby interpreter being bad.

The solution is to use a different ruby interpreter, installed via Homebrew: brew install ruby. Then update all your paths so that the command line will default to this newly installed interpreter:

export PATH="/Users/bencrestel/local/homebrew/opt/ruby/bin:$PATH"

Once this is done, we can install the jekyll gem (and bundler, but I’m not sure what that is): gem install --user-install bundler jekyll. Next, you add that to your PATH

export PATH="/Users/bencrestel/.gem/ruby/2.7.0/bin":$PATH

so that the right version of jekyll will be used. And you’re good to go.

Ref: All the info was in that post.

Signing commits in github

My company asked us to start signing our commits. Let’s see how we can do that.

Github put together a page to explain the different steps.

On my local machine

Create a GPG key

The first step is to add a GPG key to you github account. You can check if you already have one by going to Settings > SSH and GPG keys. If the field for GPG is empty, then you need to create one. To do so, you need gnupg. On Mac OSX, you can simply install the command line by typing brew install gpg.

When this is installed, you can create a GPG key by typing gpg --full-generate-key. You’ll be prompted with a series of questions, and after that you get a key.

Exporting your key to github

Now you need to export your public key. To do, find the info of your key, by typing

gpg --list-secret-keys --keyid-format LONG

On the row sec, you’ll see rsa4096 if you chose 4,096 bits RSA key, and after that is your key ID. You can next export that key by typing

gpg --armor --export <key_ID> > .gpg/public.key

This will save your public key to a file .gpg/public.key. You can share that file with whoever needs it.

In the case of github, you’ll copy the content of that file into the box provided when you want to add a new GPG key.

Sign your commits

Now that this is done, you add your GPG key to your local git config by doing git config --global user.signingkey <key_ID>. You can then sign your commit by adding the -S argument when doing git commit….. Well, that is if we didn’t break anything. Turns out that homebrew upgraded in the process of installing gpg, and it broke all my symlinks for python3. It actually installed 3.9, which as unversioned symlinks placed in a different folder. So I had to fix all that, and re-install our library.

Now that this works, I still had a few things to do before being able to sign my commits. The first, mandatory step is to create the following environment variable export GPG_TTY=$(tty). Otherwise, you will never get a prompt for the gpg passphrase and your signed commit will fail. Once this is done, you can sign your commit with the argument -S, then enter your GPG passphrase. You can check that a commit was signed by doing

git verify-commit <commit_hash>

If you want to automate all that, just enter the following 2 lines:

git config gpg.program gpg
git config commit.gpgsign true

and you can commit the same way you were doing before (without the -S argument).

Save passphrase of the GPG key

Now I still had to enter my passphrase when signing commits. Which can quickly become annoying. So I followed the steps highlighted here. At a high level, you need to brew install pinentry-mac, then create 2 files, ~/.gnupg/gpg-agent.conf in which you add

# Connects gpg-agent to the OSX keychain via the brew-installed$
# pinentry program from GPGtools. This is the OSX 'magic sauce',$
# allowing the gpg key's passphrase to be stored in the login$
# keychain, enabling automatic key signing.$
pinentry-program /usr/local/bin/pinentry-mac

then ~/.gnupg/gpg.conf where you add

use-agent

On a remote server

Obvioulsy, you don’t need to re-create the key. You can simply modify the global .gitconfig file to add user.name, user.email, user.signingkey.

Then, you also need to add your private key to gpg on the remote server. To do that, you first need to export your private key from your local machine,

gpg --armor --export-secret-keys <key_ID> > ~/.gpg/private.key

Then on the remote server, you add that private key by doing

gpg --import ~/.../private.key

You will be prompted for your GPG passphrase, and that’s it. You can check that you added the GPG key successfully doing

gpg --list-secret-keys --keyid-format LONG

After that, you can sign your commits with the -S option

git commit -S

Note that I didn’t have to set up the environment variable GPG_TTY (even though that environment variable was not defined).

You can automate the signing of commits the way you would do locally.

It looks like we could use pinentry on a unix server. But I haven’t tried yet.

References

This post is pretty good.

Ideal Calibration

In the paper Accurate Uncertainties for Deep Learning Using Calibrated Regression, the authors detail the ideal calibration. For background, a calibration $R$ takes a distributional forecast $H(x_t)$ ($=F_x$) and returns a better approximation to the true cdf $F(y)$. First of all, an ideal forecast will possess the property that \[ F_x(y) = F_Y(y) = \mathbb{P}[Y \leq y ] , \] or equivalently if $y = F_x^{-1}(p)$ with $p\in[0,1]$, \[ p = \mathbb{P}[Y \leq F_x^{-1}(p)]. \]

Now a calibration is a function $R: [0,1] \rightarrow [0,1]$ that applies to the output of the distributional model. Let’s see what $R$ needs to be to improve the approximation of the initial forecast. That is, \[ \mathbb{P}[Y \leq (R \circ F_x)^{-1}(p) ] = \mathbb{P}[Y \leq F_x^{-1}(R^{-1}(p))] \] If we define $R$ as \[ R(p) = \mathbb{P}[Y \leq F_x^{-1}(p)] \] then we have \[ \mathbb{P}[Y \leq (R \circ F_x)^{-1}(p) ] = \mathbb{P}[Y \leq F_x^{-1}(R^{-1}(p))] = R(R^{-1}(p))=p \] In some sense, $R$ corresponds to the corrected quantile for $Y$ when approximated by the cdf $F_x$.