Bro, do you even backprop?

The goal of this post is to dive deeper into the derivation of the backpropagation algorithm. This entry can be seen as a sub-post of that main post discussing a Python implementation of an artificial neural network, and all notations will be exactly the same. For conventions and definitions about vector/matrix calculus, see that post.

The loss function is defined as the sum of individual contributions for each data point $i$,

\[\mathcal{L}(\{ {\bf W}^l, {\bf b}^l \}_l) = \frac1N \sum_{i=1}^N c(\{ {\bf W}^l, {\bf b}^l \}_l, {\bf x}_i, {\bf y}_i)\]

For examples of loss functions, have a look at that post. In this post, I’m going to focus only on how to calculate the derivatives \(\frac{\partial c({\bf x}_i)}{\partial {\bf b}^l}\) and \(\frac{\partial c({\bf x}_i)}{\partial {\bf W}^l}\).

Derivatives with respect to ${\bf b}^l$

Applying the chain-rule, we get that

\[\frac{\partial c({\bf x}_i)}{\partial {\bf b}^l} = \frac{\partial {\bf a}^L}{\partial {\bf b}^l} \cdotp \frac{\partial c({\bf x}_i)}{\partial {\bf a}^L}\]

For the output layer, i.e., $l=L$, since ${\bf a}^L = {\bf W}^L \cdotp {\bf a}^{L-1} + {\bf b}^L$, applying the chain-rule again with $z^L_k$, we have

\[\frac{\partial {\bf a}^L}{\partial b^L_j} = \sum_{k=1}^{n_L} \frac{\partial {\bf a}^L}{\partial z_k^L} \frac{\partial z_k^L}{\partial b^L_j}\]

Let ${\bf e}_k = [\dots,0,1,0,\dots]^T$ be the vector filled with all zeros but for one 1 at the $k^\text{th}$ entry. Then, we have

\[\frac{\partial {\bf a}^L}{\partial z_k^L} = f'(z_k^L) {\bf e}_k, \quad \frac{\partial z_k^L}{\partial b^L_j} = \delta_{jk}\]

This means

\[\frac{\partial {\bf a}^L}{\partial b^L_j} = f'(z_j^L) {\bf e}_j\]

Following the convention defined here, we’ll stack these vectors row by row to get

\[\frac{\partial {\bf a}^L({\bf x}_i)}{\partial {\bf b}^L} = \begin{bmatrix} f'( z_1^L({\bf x}_i) ) & & 0 \\ & \ddots & \\ 0 & & f'(z_{n_L}^L({\bf x}_i)) \end{bmatrix}\]

In the end, with \(z_k^L({\bf x}_i) = ({\bf w}^L_k)^T \cdotp {\bf a}^{L-1}({\bf x}_i) + b^L_k\), we get

\[\frac{\partial c({\bf x}_i)}{\partial {\bf b}^L} = \begin{bmatrix} f'( z_1^L({\bf x}_i) ) & & 0 \\ & \ddots & \\ 0 & & f'(z_{n_L}^L({\bf x}_i)) \end{bmatrix} \cdotp \frac{\partial c({\bf x}_i)}{\partial {\bf a}^L}\]

Using the results in the post loss function, we can show actual examples. For a least-square loss function, we have \(\frac{\partial c({\bf x}_i)}{\partial {\bf a}^L} = ({\bf a}^L({\bf x}_i) - {\bf y}_i)\). Whereas in the case of a multi-class classifier with a softmax output layer (which is not actually covered by the above formula), we would have \(\frac{\partial c({\bf x}_i)}{\partial {\bf a}^L} = a^L({\bf x}_i)_{y_i} {\bf e}_{y_i}\).

For an interior layer $l$, we need to write an induction between successive layers. But first, let’s make a few observations. In the derivatives, the variables ${\bf z}^l$ and ${\bf b}^l$ are inter-changeable. Indeed, since \(\frac{\partial {\bf z}^l}{\partial {\bf b}^l} = I_{n_l}\), we have

\[\frac{\partial \cdotp}{\partial {\bf b}^l} = \frac{\partial {\bf z}^l}{\partial {\bf b}^l} \frac{\partial \cdotp}{\partial {\bf z}^l} = \frac{\partial \cdotp}{\partial {\bf z}^l}\]

Next, the result for \(\frac{\partial {\bf a}^L}{\partial {\bf b}^L}\) derived above can be generalized to any layer $l$, i.e.,

\[\frac{\partial {\bf a}^l({\bf x}_i)}{\partial {\bf b}^l} = \begin{bmatrix} f'( z_1^l({\bf x}_i) ) & & 0 \\ & \ddots & \\ 0 & & f'(z_{n_l}^l({\bf x}_i)) \end{bmatrix}\]

We now apply the chain-rule,

\[\begin{align*} \frac{\partial c({\bf x}_i)}{\partial {\bf b}^l} & = \frac{\partial c({\bf x}_i)}{\partial {\bf z}^l} \\ & = \frac{\partial {\bf a}^l}{\partial {\bf z}^l} \frac{\partial {\bf z}^{l+1}}{\partial {\bf a}^l} \frac{\partial c({\bf x}_i)}{\partial {\bf z}^{l+1}} \\ & = \frac{\partial {\bf a}^l}{\partial {\bf b}^l} \frac{\partial {\bf z}^{l+1}}{\partial {\bf a}^l} \frac{\partial c({\bf x}_i)}{\partial {\bf b}^{l+1}} \end{align*}\]

And in the end,

\[\frac{\partial c({\bf x}_i)}{\partial {\bf b}^l} = \begin{bmatrix} f'( z_1^l({\bf x}_i) ) & & 0 \\ & \ddots & \\ 0 & & f'(z_{n_l}^l({\bf x}_i)) \end{bmatrix} \cdotp ({\bf W}^{l+1})^T \cdotp \frac{\partial c({\bf x}_i)}{\partial {\bf b}^{l+1}}\]

Derivatives with respect to ${\bf W}^l$

I am going to re-use the results in the previous section. In fact, this is yet another application of the chain-rule,

\[\begin{align*} \frac{\partial c({\bf x}_i)}{\partial {\bf w}_i^l} & = \frac{\partial {\bf z}^l}{\partial {\bf w}_i^l} \frac{\partial c({\bf x}_i)}{\partial {\bf z}^l} \\ & = \begin{bmatrix} \dots & 0 & {\bf {a}}^{l-1}({\bf x}_i) & 0 & \dots \end{bmatrix} \cdotp \frac{\partial c({\bf x}_i)}{\partial {\bf b}^l} \\ & = {\bf a}^{l-1}({\bf x}_i) \frac{\partial c({\bf x}_i)}{\partial b^l_i} \end{align*}\]

Since \({\bf W}^l = [{\bf w}_1^l,\dots,{\bf w}_{n_l}^l]^T\), we stack all \(\frac{\partial c({\bf x}_i)}{\partial {\bf w}_i^l}\) in columns then take the transpose to get

\[\frac{\partial c({\bf x}_i)}{\partial {\bf W}^l} = \frac{\partial c({\bf x}_i)}{\partial {\bf b}^l} \cdotp ({\bf a}^{l-1}({\bf x}_i))^T\]

Comments

Dead neurons

It is informative to investigate what can cause the gradient of the cost functional to be zero. Note that this only applies to a single datapoint.

(1) that will be the case if the next layer is dead (all gradients evaluate to zero), b/c of the back-propagation recursion. So once you have a dead layer (all neurons in that layer are dead), then you stop backpropagating past that layer.

(2) that will also be the case if $f’(z^l_i) = 0$. With RELU, this will happen with negative values of ${\bf z}^l$, and for sigmoid / tanh, this will happen with very large values (either negative of positive).

(3) if a column of ${\bf W}^{l+1}$ is zero.

Parallelism

The backpropagation algorithm is inherently sequential, which makes its parallelism challenging, besides the embarassing parallelism of the sum over the data points. This is actually an advantage for batch algorithms, that require all $N$ points, over stochastic algorithms like stochastic gradient which compute the derivative for a single data point $i$.

Checkpointing

To compute the derivatives, you need to store all \({\bf a}^l({\bf x}_i)\) that were computed during the forward propagation. I’m wondering if memory storage could be an issue. In which case a checkpointing algorithm would be a natural solution

Higher-order derivatives

The next step is to look at how you can compute the action of a given vector on the Hessian of the loss function.

[ deeplearning  ML  optimization  calculus  ]