Intro to causal inference by Ferenc

Ferenc Huszar put together a series of 3 blog post to introduce different concepts of causal inference:

He mentions a fourth part dealing with “Causal Diagrams, Markov Factorization, Structural Equation Models”, but I haven’t seen anything related to it.

ML beyond Curve Fitting: An Intro to Causal Inference and do-Calculus

In this first post, he mainly introduces the do operator and what it means. An important point is to understand the distinction between $p(y|x)$ and $p(y| do(x))$.

  • $p(y|x)$: it’s the typical conditional distribution we are used to calculate. It comes from an underlying joint distribution (say $p(x,y,z)$ for the sake of argument) and we have the relation $p(y|x) = p(x,y)/p(x)$ where the pdf in the ratio are marginal of the original joint distribution. But that joint distribution $p(x,y,z)$ corresponds to the data we observed. That is the fundamental difference with the other one.
  • $p(y|do(x))$ is also a standard conditional distribution, albeit calculated from a joint distribution that does not correspond to the data we observed. It is calculated from the joint distribution $p_{do(X=x)}(x,y,z)$ where we force the random variable $X$ to take the value $x$ by intervention (eg, A/B testing, randomized controlled trial,…). In that case, we have $p(y|do(x)) = p_{do(X=x)}(x,y) / p_{do(X=x)}(x)$ (Note: it would seem to me that we should have $p_{do(X=x)}(x) = 1$; but I’ll double check)

In other words, $p(y|x)$ answers the question “how $y$ changes given $x$” based on observational data, while $p(y|do(x))$ answers that same question from an interventional point of view. The main point of causal inference is to try and estimate this $p_{do(X=x)}(x,y,z)$ even without a randomized controlled trial. And this is what the do-calculus is here for. Ferenc doesn’t deep dive into do-calculus, but instead points to a reference paper.

In order to estimate this interventional pdf, one needs a causal model, that is a model that explains how the different features are related to each other (not just relations, but what causes what). Once you have such a model, you can start altering the observed pdf. But, “The mapping from causal diagrams to joint distributions is many-to-one: several causal diagrams are compatible with the same joint distribution. Thus, it is generally impossible to conclusively choose between different causal explanations by looking at observed data only.”

You can test the validity of a causal model against the data you have. And you can extend that idea to do causal discovery. That is, test multiple causal models to find the one that best match the data. However, you are still blocked by the fact that the relationship between causal model and data is many-to-one. So a causal model is more of a modeling choice.

Illustrating Interventions via a Toy Example

In his second post, Ferenc shows an illustrative example of why the joint distribution doesn’t tell the whole story. He generates a 2d multivariate Gaussian distribution in 3 different ways. They all have the same joint, but the ways they are constructed are different. Therefore, they all have the same conditional distributions, for instance $p(y|X=3)$.

Then he simulates an intervention; basically, he conditions on $X=3$. But because of the way each example is constructed, they do not lead to the same distribution for $p(y|do(X=3))$. The first conclusion is that you cannot predict the results of an intervention from the joint distribution alone, as all 3 examples had the same joint.

The way to make a correct prediction for this intervention is to use the causal diagram for each example. “Graphically, to simulate the effect of an intervention, you mutilate the graph by removing all edges that point into the variable on which the intervention is applied, in this case x.” After doing this, you see that some causal diagrams are un-modified (therefore the interventional distribution is the same as the original conditional distribution). Whereas in other examples, both variables become independent (and therefore the interventional distribution become the original marginal). “This is called causal infernce from observational data.”

Counterfactuals

This post discussed counterfactuals, and mainly how they differ from intervention conditionals, like $p(y|do(X=\hat{x}))$, that was introduced in previous posts. The main difference is that intervention conditionals deal with the whole population (they are averages of conditional probabilities), whereas counterfactuals deal with a single individual. Ferenc uses the example of “does having a beard had anything to do with having a PhD?” So would I get a PhD if I didn’t have a beard. In that case, the intervention conditionals look at the distribution of all PhDs if we had shaved everyone who had a beard in the universe. Since both (having a beard and having a PhD) are actually not causaly related, we would end up with the marginal of PhD holders in the world. On the other hand, if we care about Ferenc in particular, this is not the average we want. We want to know how shaving that one specific inidividual would have affected their chances of gaining a PhD. To illustrate that, he introduced Structural Equation Models (SEM), which will allow us to write things more rigorously.

SEMs are a set of equations that represent and quantify the causal dependencies of the variables in the problem. That is, it defines the causal relations (like a causal graph), but also quantifies these relations (ie, it defines the equation to calculate one quantity from the others). So from a SEM, you can deduce the causal graph, but the SEM also tells you how all the variables are related to each other (not just what variables are related to each other). Since you have some uncertainty, you also have random variables $\varepsilon_i$ that enters the SEM. For instance, assuming $u$ is a deterministic input, you could have variables $x$ and $y$ defined as \(x = f_1(u, \varepsilon_1), y = f_2(x, u, \varepsilon_2), \dots\)

When you simulate intervention in a causal graph, you remove all edges pointing toward the variable you intervene on. In the case of SEMs, this means you remove the single equation defining that variable and instead you set that variable to a constant deterministic value. The rest of the graph/equations remain unchanged. This also means that the realizations of the noise $\varepsilon_i$ (not just the random variable, but the actual realization of the noise for each equation) remain unchanged. However, the variables obtained from the mutilated graph are not exactly the same as the one from the original graph. Ferenc defines them as some sort of parallel twin living in a parallel universe. The reason he is saying that is because we mix components of the “real world” (all other equations but the one corresponding to the intervention; realizations of the noise;…) with components that did not happen (ie, intervention). He defines these variables with an asterisk, eg \(y^*\). One key sentence in the article is that “counterfactuals are making a prediction about features of the unobserved twin datapoint based on features of the observed datapoint.

Now that this is out of the way, we can write down clear equations for that a counterfactual is. In the blog post example of the PhD (y) and the beard (x), the counterfactual can be written as \(p(y^*=1 | do(x=0), y=1, x=1)\). In the blog post, he actually writes this as \(p(y^*=1 | x^*=0, y=1, x=1)\). What’s important to understand is that we have 2 joint distributions:

  • $p(x,y)$ from the data, and
  • \(p(x^*, y^*)\) from the intervention. Actually, this goes even further. Because both distributions rely on the same realization of the noise, both distributions actually come from the same joint distribution \(p(x,y,x^*,y^*)\). This is the reason why we can make counterfactuals in the first place!

Now that we have this large joint distribution, we can make the connection between intervention conditionals and counterfactuals: \(p(y|do(X=\hat{x})) = p(y^*|x^*)= \int_{x,y} p(y^* | x^*, x, y) p(x,y) dxdy\) Such that the intervention conditional is the expectation of all counterfactuals taken over the observed data.

Sequence Models by deeplearning.ai

Week 1

Video: Notation

\(x^{<t>}\): \(t^\text{th}\) entry in sequence $x$

$T_x$: length of sequence $x$

\(x^{(i)<t>}\): entry $t$ in training example $i$.

Vocabulary, Dictionary: ordered list of all words used (10,000-1M). Then each word in that vocabulary can be encoded via one-hot encoding. For unknown words, you can have an “unknown” token.

Video: RNN Model

Activation which is passed from one step to the next is called $a$. Typically set \(a^{<0>}=0\), then \(a^{<t>} = g_1(W_{aa} a^{<t-1>} + W_{ax} x^{<t>} + b_a)\). Activation function typically tanh/ReLU. Notation often condensed by stacking the matrices such that $W_a = [W_{aa}, W_{ax}]$.

Output of RNN layer at step t is called $y$. \(y^{<t>} = g_2(W_{ya} a^{<t>} + b_y)\). Activation function typically sigmoid.

Important details is that the weights $W_a, W_y$ are shared across all time steps.

Video: Backpropagation through time

Don’t give much details, simply that the loss is the sum of an elementary loss at each time step. Something that can be denoted by \(L(y, \hat{y}) = \sum_{t=1}^{T_y} L(y^{<t>}, \hat{y}^{<t>})\). When you back-propagate, you back propagate through time, ie, “from right to left”. Again, keep in mind that the weights are shared across all time steps.

Video: Different types of RNNs

  • one-to-one: not really an RNN; it’s just a fc layer
  • one-to-many: only have an input for the first step. After that first step, \(x^{<t>}\) is replaced by the output generated in the previous step \(y^{<t-1>}\) (like when DeepAR is predicting). This is the case of music generation, for example.
  • many-to-one: You have multiple intputs, but only one output you care about. In that case, you only care about the last output. This is the case of sentiment analysis, for example.
  • many-to-many ($T_x=T_y$): standard case where each input gives an output, at each step. There is a potential limitation that the output only depends on the previous steps; but this can be remedied with bidirectionnal RNNs (or LSTMs).
  • many-to-many ($T_x \neq T_y$): in that case, you want to use an encoder-decoder architecture. That is, you first pass all of the inputs ($T_x$), without using the outputs. Then after passing the input, you start generating the output (most likely feeding each output as an input at the next step).

This list does not include attention, which is covered in week 3.

Video: Language model and sequence generation

Language model returns the likelihood of a sentence; this can used to decide what was most likely said for instance (by comparing likelihood of 2 sentences).

To do that, you can formulate it as a many-to-many RNN, where the output of the RNN is passed through a softmax and is trained to predict the probability of the next word, given the first few words in the sentence, \(P(y^{<t>} | x^{<1>},...)\). Because you now have a lag, you start with the first input being 0, ie, \(x^{<0>}=0\).

You define the loss at each time step to be \(-\sum_i y_i log(\hat{y}_i)\) (cross-entropy loss), with $y_i$ being non-zero (ie, 1) only for the true word. Then for the overall loss you sum over all time steps. That is equivalent to the log-likelihood of the probability of that sentence, as you recursively condition on the previous works in the sentence. That is \(log P(x_1, x_2, x_3) = log P(x_3|x_1,x_2) + log P(x_2 | x_1) + log P(x_1)\).

Video: Sampling novel sequences

Sample from a trained language model (see previous video). You sample by recursively sampling a word from your language model, then plugging it back into your model as an input for the following step.

Video: Vanishing gradients with RNNs

Deep neural networks are hard to train as they lead to vanishing/exploding gradients. RNNs are no different; long sequences will make it resemble a deep neural networks. If vanishing gradients can be hard to spot, exploding gradients will lead to extremely large parameters (potentially NaN) and can be remedied by gradient clipping (re-scaling the gradient so that its norm does not exceed a fixed threshold).

Video: Gated Recurrent Unit (GRU)

The main idea is to introduce a memory cell $c$ to try and keep information for longer number of steps. In GRU, $c$ directly replaces the activation $a$. It is updated in each cell by taking a convex combination of its previous value and a tentative update $\tilde{c}^t$,

\(c^t = \Gamma_u * \tilde{c}^t + (1-\Gamma_u) * c^{t-1}\), where $\Gamma_u$ is a gate value between 0 and 1, and the multiplications are point-wise. It is given by \(\Gamma_u = \sigma(W_u [c^{t-1}, x^t] + b_u)\) where $\sigma$ is the sigmoid function. The tentative update is given by

\(\tilde{c}^t = tanh(W_c [\Gamma_r * c^{t-1}, x^t] + b_c)\), with $\Gamma_r$ a relevance gate \(\Gamma_r = \sigma(W_r [c^{t-1}, x^t] + b_r)\).

Video: Long Short Term Memory (LSTM)

It is slightly different from the GRU with 3 gates (update, forget, output) instead of 2, and the relevance gate is replaced by direclty using the previous output.

\(\Gamma_{u,f,o} = \sigma(W_{u,f,o} [a^{t-1}, x^t] + b_{u,f,o})\). Notice that instead of $c^{t-1}$ in the gates, we now use $a^{t-1}$.

Now the udpate becomes

\(c^t = \Gamma_u * \tilde{c}^t + \Gamma_f * c^{t-1}\). The $\tilde{c}$ is calcultaed in terms of $a^{t-1}$ instead of $c^{t-1}$,

\(\tilde{c}^t = tanh(W_c [a^{t-1}, x^t] + b_c)\). And the output is

\(a^t = \Gamma_o * tanh(c^t)\).

A really good reference for GRU and LSTM is Colah’s blog post on LSTM. There, he uses the more typical notation for LSTM and calls $h^t$ the output (instead of $a^t$).

Video: Bidirectional RNN

Unroll input forward and backward (not backprop though). Then concatenate output from both directions and pass to linear layer to generate output of each step.

Can also have Bidirectionnal LSTM or GRU. Actually BLSTM is very popular for many NLP tasks.

Disadvantage: you need entire sequence before making a prediction. This can be a problem for certain applications (eg, speech recognition).

Video: Deep RNNs

You can stack layers of RNNs/LSTMs/GRUs on top of each other. For each layer, the input stacks its activation from the previous step with the output from the layer below at that same time steps.

For RNNs, 3 layers is already quite a lot; so not very deep RNNs networks. But sometimes, you find a bunch of RNN layers, then on top seats a fc deep network.

Week 2

Video: Word Representation

One-hot encoding does not give any indication about the similarity between 2 words. They all have the same distance with each other.

Another way to represent words is to use an embedding, ie, a lower-dimensional vector representation, where words with similar meaning are (hopefully) close to each other.

Also mention t-SNE that is a way to visualize in 2D very high dimensional data (eg, word embeddings).

Video: Using Word Embeddings

Transfer learning and word embeddings:

  1. Train word embeddings from large corpus online (1B-100B words); or even use pre-trained word embedding
  2. Transfer embedding to new task with smaller training set (~100K words). For instance, use word embedding as input for name recognition algorithm -> word embedding should simplify task as it recognizes similar words and present them as similar input to your model.
  3. (Optional) fine-tune word embedding with new data.

Transfer Learning typically useful when you have lots of data for task 1, but limited data for task 2.

Relation to face encoding: encoding (as in face encoding) ~ embedding (as in word embedding).

Video: Properties of word embeddings

Word embedding can learn analogies between words (eg, man->king, woman-> queen).

Note that t-SNE is a non-linear mapping (or rather, not an isometry), the geometrical properties found in high-dimensions are not preserved with t-SNE.

How to measure similarity between vectors?

  • Cosine similarity (inner product divided by product of norms)
  • squared norm of the difference between 2 vectors

Video: Embedding matrix

Can store all your embeddings into a matrix of dimension “embedding dim” (row) x “number of words in corpus” (col). So if you right-multiply this embedding matrix by a one-hot encoder, you recover the embedding for that word.

Video: Learning word embeddings

Alg for learning word embedding have been simplified over time. But it’s informative to look at the first, more complex, algorithms.

You can learn word embedding by building a neural language model. In A neural probabilistic language model, the authors build a simple neural network that takes a sentence, convert each word to its embedding, pass it through a linear layer, then a softmax to predict which one is the next word. You can train the entire network (weights and embeddings) on all text data you have, using SGD/Adam/… Sometimes, people will only use the last few (eg, 4) words to predcit the next (instead of the whole sentence).

Sometimes, people also use words before and after (eg, 4 words before, 4 words after) to predict the word in the middle. Or you could use only the last word. Or even use a nearby 1 word (idea of a skip-gram model).

In the end, to build a language model, use the last N-words. But if you just want to build a word embedding, then you can use any of the other contexts mentioned here (words around, last 1-word, nearby 1-word,…).

Video: Word2Vec

Introduce the skip-gram model used to build a Word2Vec representation. This blog post provides a nice introduction. The basic model is very simple. It is basically the succession of 2 matrix multiplication. The first one by a matrix of size $\mathbb{R}^{N \times d}$ converts a one-hot encoder representation of a word (corpus of dimension N) into a word embedding (dim d). The second one does the opposite and has dimension $\mathbb{R}^{d \times N}$. That last layer is followed by a softmax, $e^{x_i}/\sum_j e^{x_j}$, for each potential output word $i$. The loss if the typical cross-entropy loss, that is $-\sum_k y_k \text{log}(\hat{y}_k)$, where $y \in mathbb{R}^N$ is zero everywhere except for a single entry.

For any sentence, the input (context) is a random word, and the quantity we are trying to predict is any word in that same sentence, within a certain window around the input word. That output word is taken at random. In the end, it doesn’t really matter as we’re not trying to predict that word, we are just trying to build an embedding.

The problem of that approach is that it is extremely slow due to the need to sum over all $N$ outputs of the softmax for every prediction. A workaround is to use a hierarchical approach. But better methods have been designed and will be presented in the next videos.

Original paper: Efficient Estimation of Word Representations in Vector Space

Video: Negative Sampling

Reference paper: Distributed Representations of Words and Phrases and their Compositionality,)

That previous blog post also covers negative sampling.

Actually, the blog post and the video kind of differ about the exact details of the implementation. So we can also look at Manning’s notes.

In the end, the idea is to not look at all $N$ words in the vocabulary, but only the one true word that matches the context and a small number (5-20) of words that are not correct. You therefore turn to a small number of binary classifications. You can hence use a sigmoid loss to force your network to have the true word match the output and the other words return 0. The actual details seem to be that the sigmoid loss is applied to the inner product between the embeddings of both words. Words that are close in a sentence are assumed to be similar.

Video: GloVe word vectors

GloVe = Global Vectors for Word Representations

GloVe builds from the co-occurrence matrix $X_{ij}$ which counts how often word $i$ falls within the vicinity of word $j$. Then we train the word embedding so that the inner product between 2 words match (more or less) the logarithm of their co-occurence. The match is enforced by a square loss, and a weighting term is placed in front, which also takes care of cancelling the loss when $X_{ij}=0$.

Video: Sentiment Classification

Typical situation is that you don’t have that many reviews to train your sentiment classifier. But you were able to use a very large dataset to train your word embedding. So you can use the embedding of the words in your review (instead of the words direclty) to learn a more robust sentiment classifier with a small dataset.

Also, you could just average the embeddings of all the words in your sentence, then pass it through a softmax layer (for each of the possible ratings). But this doesn’t take the order into account. A better way is to pass the embeddings of each word to a LSTM and output a classification at the very end (many-to-one).

Video: Debiasing word embeddings

Bias is likely to creep into the dataset used to train a Word2Vec. So it’s important to be aware of it, and if possible remove that bias.

One way this can be done is by

  1. identifying the bias directions: for instance for gender bias, take all the gender directions (boy - girl, king - queen, …), and take the average
  2. Neutralize: all words that should be free of gendre bias, just project them perpendicularly to the gender directiojn
  3. Equalize: some words are associated to gender, but should be at the same distance to gender-free words (eg, nurse shouldn’t be closer to woman than man). So you can also fix that

Week 3

Video: Basic Models

Introduced 2 types of models:

  • Sequence-to-sequence: this can be used for text translation. This uses a encoder-decoder architecture, which is a many-to-many architecture with $T_x$ and $T_y$ that are different. So you just pass the input to your LSTM, without paying attention to the output generated (encoder), then start generating the $T_y$ outputs by feeding any genereated token back to the LSTM (decoder).
  • Image Captioning: You can automatically generate captions from an image, by using a similar idea. You would simply replace the encoder in the previous part with a CNN (eg, AlexNet) w/o the output (softmax) layer and feed the generated embedding of that image into an LSTM, that would then produce the caption.

Video: Picking the most likely sentence

In machine translation, the encoder-decoder architecture can be interpreted as a condition language model. That is the decode part looks very much like a language model, but instead of being instantiade by a hidden cell of zeros, it is instantiated by the output of the encoder part.

When generating a translation, you don’t want to sample the joint probability distribution of your decoder at random. You instead want to pick the most likely translation, i.e., $\arg \max P(y_1, y_2, \ldots | x)$. A naive approach is to pick each most likely work successibley, ie, $\hat{y}_1 = \arg \max P(y_1|x)$, then $\hat{y}_2 = \arg \max P(y_2 | \hat{y})_1, x)$,… However, this tends to return sub-optimal results. Instead, the correct approach is to search the joint probability space directly. This space, unfortunately, is too large to be sampled exactly. For instance, imagine your look for the most likely 10 words translation with a vocabulary of 10,000 words, that means $10,000^{10}$ possibilities. That’s why we need to use an approxiamte search, eg, beam search.

Beam search is an approximate search algorithm to search for the maximum of a joint probability distribution. The key parameter is the beam width, which defines the number of most likely combinations that the beam search algorithm keep at each step. So for instance if the beam width is 3, then at each step $k$, beam search only keeps the 3 most likely combinations of words (or partial sentences) when moving on to the next step $k+1$, i.e, $(y_1^1, y_2^1,\ldots ,y_k^1), (y_1^2, y_2^2, \ldots, y_k^2), (y_1^3, y_2^3,\ldots, y_k^3)$. At the next step, it only starts from these 3 combinations, then search for the 3 most likely combinations $(y_1,y_2,\ldots, y_k, y_{k+1})$ across those 3 networks. So at each setp, no matter after how many recursions, we only have to consider $3N$ combinations, with $N$ the size of the vocabulary. Instead of $N^2$ with the exhaustive search (not that $N^2$ is only at one step, but this compounds with the depth).

Notice that when the beam width is set to 1, we recover the greedy algorithm described in the previous video.

Ref: Dive Into Deep Learning

Introduce length normalization. Beam search, as introduced before, tends to favor short sentences as it searches to maximize the product of smaller numbers (less than 1.0). To circumvent that problem, a heuristic is to replce the joint probability by $P^{1/T_y^\alpha}$ where $T_y$ is the length of the sentence and $0 < \alpha \leq 1$. Since we typically maximize the log of the joint probability distribution, and after writing the latter as a product of conditional pdf, we get $1/T_y^\alpha \sum_i P(y_i | x, y_1, y_2,\ldots)$.

The optimal hyperparameter needs to be found empirically, as for the beam width. $\alpha = 1$ will promote long sentences, while $\alpha=0$ will not normalize at all. Beam width of $1$ will be like greedy search (cheap, but very sub-optimal). On the other hand, a large beam width will yield much better results but at much higher compute cost.

Now we have a model made of 2 parts: i) estimate the probability of the language model (RNN); then ii) estimate the most likely sentence (beam search). Using human translated sentences, you can check what part of your model (RNN or Beam search) is at fault, then calculate the fraction of errors for each part. You can then act on the faulty part of your model (modify model/add more data/…, or increase beam width)

Video: Bleu score

A pretty terrible video. Probably best to look at the Wikipedia page. Roughly, Bleu computed a modified precision for a machine-generated translation compared to several reference (human-generated) translations. There exists modification factors to penalize short translations.

Video: Attention Model – Intuition

Attention model replaces standard encoder-decoder architecture. The encoder is still the same. However the decoder is a different network that at each step takes as input a combination of all the outputs from the encoder weighted by attention weights. These attention weights tell the decoder on what part of the input it should focus its attention. These weights are computed from the previous hidden values and all the outputs of the encoder.

Video: Attention Model

In the classical paper that introduced attention, neural machine translation by jointly learning to align and translate, they use a BRNN for the encoder and RNN for the decoder with attention. At each step of the decoder, the RNN receives the previous state, along with a context that is a linear combination of the outputs of each encoder step.

The weights are all non-negative and sum to 1. To get that, you simply pass the intermediate weights to a softmax layer. Each intermediate weight comes from an alignment model, which is (typically) a single layer neural network that takes the previous state of the decoder and the output of that step from the encoder. Then tthis alignment model returns a score for each step of the encoder. All these scores are converted to attention weights, which are then used to combine the output of the encoder into an input context for that step of the decoder.

Video; Speech recognition

Speech recognition systems takes waveform from speech and returns text. This is typically done with sequence models, at the character level, using the CTC cost (Connectionist temporal classification). What this does is that it has a “blank” character, and it allows repetition of characters. Then it collapses all repeated characters not seperated by blanks, then remove the blanks, to end up with the sentence.

Note that speech recognition requires very large dataset (~100,000h for best industrial applications)

Video: Trigger Word Detection

Can train network to get triggered by certain words. Could pass waveform through a RNN/LSTM/… and return 0 all the time except when trigger word is pronounced in which case you return 1.

Might have some issue with an imbalance dataset though.

Notes for CNN course by deeplearning.ai (week 4)

This week focuses on Face Recognition and Neural Style Transfer.

Video: What is face recognition?

Verification vs Recognition? Verification means you have a tuple (image, name) and you must return whether they correspond to the same person or not.

Recognition means you have K persons in your database and given an input image you try to detect whether that image belongs to one of the K persons in your database.

You actually need very high accurate verification systems to be able to do recognition systems. As you are basically running a recognition K times. So you have K more chances to make an error. This is why recognition is typically considered a much harder problem.

Also, face recognition is one-shot learning problem. You typically have a single example of each face, and you want to train a face recognition algorithm from that.

Video: One Shot Learning

The challenge with face recognition is that you have very little samples (1 per individual). That is a one shot learning problem.

You cannot solve it as a typical object/face recognition problem. Instead what you can do is train a network to return a similarity function between a given input image and each of the individuals you have in your database (+ category “unknown”). And you can decide of a threshold under which you declare a match between the input and a person in your db.

Video: Siamese Network

You can define 2 identical networks (same parameters -> siamese)) that will each take a face image as input, pass it through a set of conv nets, then fc layers, then output a vector (say, 128-d vector). Then you can define similarity between both faces as $| f(x_1) - f(x_2)|^2$

Video: Triplet Loss

Loss is defined from a triplet of 3 images. Using the notion of margin (support vector margin), you define a positive float $\alpha$, then you want to have $| f(A) - f(P)|^2 - | f(A) - f(N)|^2 + \alpha \leq 0$, where A=anchor, P=positive, N=negative. Actually, since we don’t need this to be extremely negative, we just want this to less than zero. So to avoid having the network pushing on a corner case, we take $max(|…. + \alpha, 0)$.

To train that system, you need to have multiple faces per person (so not really one shot….). Then when selecting the triplet, you want to select triplet such that the 2 norm diff are actually close to each other. So that it forces the network to learn something. That is to say, you don’t pick A, P, N at random.

Video: Face Verification and Binary Classification

Face recognition can also be posed as a binary classification problem, by taking the same idea of 2 siamese networks, then passing both 128-d output vectors to a logistic regression unit (maybe passing the absolute value of the entry-wise difference between both 128-d vectors). There are also variants.

Notice that you can pre-compute the encodings for all people in your database, so that you only have to compute the encodings for the face you are trying to recognize (at inference time). This will speed up computation time.

Video: What is neural style transfer?

To implement style transfer, you need to understand (and use) the features extracted by the different layers of a convnet. But for that, you need to understand what is being extracted.

Video: What are deep ConvNets learning?

Typically, layers detect from coarser to finer details as you go deeper into the convnet.

Video: Cost Function

Cost function is the sum of a content cost function and a style cost function, both measuring the similarity between the generated image and the content image or style image.

The image is generated by applying gradient descent directly to the generated image, starting from a random image.

Video: Content Cost Function

The content is defined as the squared norm of the difference between the activation of both content (input) image and generated image, at a given layer of a pre-trained ConvNet. The choice of what layer activation you are using is decided by how much granularity you want to preserve in the generated image.

Video: Style Cost

As for the content cost function, you pick a layer from your ConvNet, then work with the activation function of that layer.

The style of an image is defined via its Style matrix, which is an $n_c \times n_c$ matrix where each entry is the point-wise multiplication of all activations of each channel. In the video, they call that the correlation between both channel; that’d be true if the channels all had zero means. \(G_{k,k'} = \sum_i \sum_j a_{ijk} a_{ijk'}\).

You calculate the style matrix for the style image and the generated image. The style cost function is defined as the squared Frobenius norm of the difference between both style matrices. Except that there is a normalization constant in front (but there is another multiplicative factor in front of the cost function).

Actually, the overall style cost function is a weighted sum of the style cost function of each individual layer.

The overall cost function is the weighted sum of the content cost function and the style cost function. Then you minimize your cost function (gradient descent, or other optimizer) by changing the generated image.

Video: 1D and 3D Generalizations

Show how convolutional layers can be used on 1d and 3d data, not just 2d data.

Notes for CNN course by deeplearning.ai (week 3)

Video: Object Localization

Distinguish:

  • Classification: just tell what is on the picture (typically a single object per picture)
  • Classification with localization: tell what it is + say where it is (box around object). Typically 1 object/picture
  • Object Localization: Box around all objects in the picture (typically multiple objects per picture)

Classification with localization: in addition to softmax output layer, add another layer for box info: bx, by, bh, bw = coordinates of center of the box (bx, by) and height/width of the box (bh, bw).

Look at problem where object can be: pedestrian, car, motorcycle, or background (no object). Proposes parametrization:

  • pc = 1 if object (1,2,3) or 0 if no object (4)?
  • bx, by, bh, bw
  • c1, c2, c3: 1/0 if an object is there

What is the loss in that case? Loss will differ whether you have an object or not. When no object (ie, pc=0), then loss = $\text{dist}(y_1, \hat{y}_h)$. If there is an object (pc=1), then loss = $\sum_i \text{dist}(y_i,\hat{y}_i)$. For the distance, we can use different distance depending on the outputs. For instance, a log like feature loss (? log loss) for c1, c2, c3, a squared error for the box, and logistic regression loss for pc.

I honestly don’t like that parametrization. I would rather combine pc and c1, c2, c3 and train that with a cross entropy loss (or softmax + log-likelihood) and use a squared error loss (or whatever) for the box. In the case there is not object, just set the target box to be the entire picture (centered at the center).

Video: Landmark Detection

Instead of outputing the parameters of a box, you can simply output coordinates of important parts of an image (called landmark). Useful in AR, or snapchat features.

Can modify the output layer to return coordinates of as many landmarks as you want. Problem is you need a training dataset. So someone has to annotate pictures with all these landmarks on it. Also, important for the labels to be consistent across all images (eg, label 1 always for left eye).

Video: Object detection

Object detection means we want to place boxes around potentially multiple objects on an image.

Introduces Sliding Windows Detection algorithm. If we try to detect cars on an image:

  1. Train a ConvNet to detect car from tightly cropped pictures
  2. Then given an object where you want to detect objects, slide a window across the entire image, and each time you slide it, run the small window through your car detector (step 1)
  3. Repeat procedure with larger and larger windows.

If there is a (or multiple) car(s) on the picture, at some point, one of the windows should have isolated that car and the detector should have detected it.

Disadvantage of sliding windows detection: computational cost! Uses many, many windows. Could use coarser strides (when shifting windows), but that will reduce efficiency of the localization algorithm.

That technique worked when we had cheap detector. With ConvNet, this doesn’t work anymore. But there is a solution (next video)

Video: Convolutional Implementation of Sliding Windows

Reference: OverFeat, by Sermanet et al., 2014.

When running the sliding window algorithm through a ConvNet, there is a lot of repeated calculations. You can leverage this by running the entire image through your ConvNet (instead of the cropped window). The result will be a larger output and each additional values correspond to running additional cropped windows through the network. With that convolutional approach, the stride of the sliding window is given by the size of the max pooling layer.

Video: Bounding Box Predictions

Sliding windows return bounding boxes that are not very accurate, as box could not be exactly rectangular and/or sliding windows may not overlap exact location of best box.

Solution: YOLO (You Only Look Once) algorithm. The idea is actually very close to the one introduced for the classification with localization problem. Except that here you split your image into a grid. Then in each grid, you return the 8-dimensional vector described earlier: pc, bx, by, bh, bw, c1, c2, c3. So you run your image through a ConvNet only once and output a tensor of dimension $N \times N \times 8$, where $N$ is the dimension of your output grid. For each grid, you try to match the target 8-d vector. That way, the boxes can be defined much more precisely (no fixed boxes; instead the network infers the center and dimensions of the box). Also, because we only use the input image once, the YOLO algorithm is quite fast and can be used for real-time image detection.

The box is encoded relatively to each individual sub-grid. The position must be between 0 and 1 (sometimes passed through a sigmoid), but height and width can be greater than 1 (sometimes passed through an exponential to force non-negativity).

Video: Intersection Over Union

How can you evaluate object localization algorithm? You can use Intersection Over Union, which is a measure of overlap between 2 boxes. It takes the ratio of the area of their intersection over the area of their union. That ratio is between 0 (both boxes don’t intersect at all) and 1 (both boxes are equal). Often, it is considered that a IoU greater than 0.5 means the box is correct.

Video: Non-max Suppression

Even though each target box only belongs to a single cell in the grid, when predicting, the network is likely to return multiple boxes corresponding to the same object. A simple solution is to use the non-max suppression algo. While there are boxes, pick the one with the highest pc score (in case there is only a single category detected) and discard all other boxes that have a high IoU score with that box (eg, greater than 0.5). Then move on to the box with next highest score, etc…

Video: Anchor Boxes

What can you do if multiple boxes have their center inside the same grid cell? Even though it doesn’t happen too often (especially with a fine enough grid), it’s good to have a plan for that.

The solution propoosed is simply to repeat the output vector for a single object by the number of objects you want to be able to detect. So for instance, to allow detection of 2 objects per grid cell, the output will be: pc, h_x,…,c_1,c_2,c_3, pc, h_x,…,c_3. How do you decide which object is object 1 and object 2?

You can do that using anchor boxes. You set 2 anchor boxes, for instance 2 rectangular boxes, one vertical and one horizontal. In each grid cell, if an object has a shape more vertical it is object 1 and if it is more horizontal it is object 2 (can use IoU to categorize each object)). Then you classify each target object according to that rule, and you train your object detection network. It will allow the network to specialize its output to each type of box.

Video: YOLO Algorithm

The YOLO algorithm puts together all the different components we saw in that week to do fast object detection.

Set your target vectors using the box parametrization we introduced before (pc, hx,…,c1,…) and the idea of anchor boxes. Then you can train your network using that dataset. After running prediction, you post-process the results using a non-max suppression algorithm. And you’re done.

Region Proposals

Alternative idea to first propose a small number of regions where an object could be, then classify each proposed region. The proposed regions come from a segmentation type algorithm. Different approaches exist, with fastest approach being formulated as a convnet.

How to rename a commit

In all cases, I’m assuming the targeted commit hasn’t been pushed. If that’s the case, then it’s a different story (and probably a bad idea to do it).

If it’s the latest commit

You can simply amend the latest commit by doing

git commit --amend

then change the commit name and save

If it’s not the latest commit

In that case, you will need to rebase. Assume you want to change the Nth commit from the top (the latest is the first). Then interactively rebase the last N commits,

git rebase -i HEAD~N

This will prompt a window that shows the last N commits. From there, you can do a few things (and a lot of damage, so careful!). But in our case, you just want to change pick to reword in front of the commits that you want to modify. Then save. It will then open each commit windows for the commits you selected in the previous step. For each commit, modify the name the way you want it, then save.