The softmax activation function
23 Nov 2018Notations are defined in that post.
A very popular activation function for classification with a deep net is the softmax, which is defined as
\[{\bf a}^l = [ \frac{\exp(a^{l-1}_1)}{\sum_{i=1}^{n_{l-1}} \exp(a^{l-1}_i)}, \dots, \frac{\exp(a^{l-1}_{n_{l-1}})}{\sum_{i=1}^{n_{l-1}} \exp(a^{l-1}_i)} ]^T\]That is ${\bf a}^l = F^l({\bf a}^{l-1})$ with $F^l(x) = [f_1(x), \dots, f_{n_l}(x)]^T$ where
\[f_i(x) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}\]That type of activation function is different to what was covered in that post for 2 reasons: (1) it does not have any parameters to optimize, and (2) each coordinate depends on all other coordinates. Let’s see how we back-propagate the gradient in that case.
For the output layer, which is the most common use-case for a softmax layer, we need to modify the chain-rule to skip the output layer and start instead at layer $L-1$,
\[\frac{\partial c}{\partial {\bf b}^{L-1}} = \frac{\partial {\bf a}^{L-1}}{\partial {\bf b}^{L-1}} \frac{\partial {\bf a}^{L}}{\partial {\bf a}^{L-1}} \frac{\partial c}{\partial {\bf a}^{L}}\]The only term that was not covered previously is \(\frac{\partial {\bf a}^{L}}{\partial {\bf a}^{L-1}}\) which is simply the gradient of $F^L$.
For a hidden layer, which is definitely not the most common case for a softmax activation function but could the case of another type of function, we have
\[\begin{align*} \frac{\partial c}{\partial {\bf b}^{l-1}} & = \frac{\partial {\bf a}^{l-1}}{\partial {\bf b}^{l-1}} \frac{\partial {\bf a}^{l}}{\partial {\bf a}^{l-1}} \frac{\partial {\bf z}^{l+1}}{\partial {\bf a}^{l}} \frac{\partial c}{\partial {\bf z}^{l+1}} \\ & = \frac{\partial {\bf a}^{l-1}}{\partial {\bf b}^{l-1}} \frac{\partial {\bf a}^{l}}{\partial {\bf a}^{l-1}} \frac{\partial {\bf z}^{l+1}}{\partial {\bf a}^{l}} \frac{\partial c}{\partial {\bf b}^{l+1}} \\ \end{align*}\]Again, the only term we didn’t before is \(\frac{\partial {\bf a}^{l}}{\partial {\bf a}^{l-1}}\), which is the gradient of $F^l$.
[deeplearning
ML
optimization
calculus
]