Loss functions

This post is a simple summary of some common loss functions, and their relations to maximum likelihood estimation (MLE). Given the data \(\{ ({\bf x}_i, {\bf y}_i) \}_i\), and model parameters $\theta$, we define a general loss function as

\[\mathcal{L}(\{ \theta \}) = \frac1N \sum_{i=1}^N c(\theta; {\bf x}_i, {\bf y}_i)\]

We also assume that that model make the predictions \({\bf f}(\theta; {\bf x}_i)\).

Regression

In a regression setting, a very common choice is a least-square loss function, or $L_2$ loss function, i.e.,

\[c(\theta; {\bf x}_i, {\bf y}_i) = \frac12 \| {\bf f}(\theta; {\bf x}_i) - {\bf y}_i \|^2\]

To calculate the gradient of the loss function, we need the partial derivative of $c$ with respect to the output of the model $f$, which in this case is

\[\frac{\partial c(\theta; {\bf x}_i, {\bf y}_i)}{\partial {\bf f}(\theta; {\bf x}_i)} = {\bf f}(\theta; {\bf x}_i) - {\bf y}_i\]

The main advantage of that loss function is its simplicity. Linear regression with that loss function can be solved analytically. The least-square loss function is twice differentiable everywhere. In the case of a model with Gaussian noise, it also connects with MLE. Indeed, if we assume that the true model is

\[{\bf y}_i = {\bf f}(\theta; {\bf x}_i) + \varepsilon\]

with $\varepsilon$ having a normal distribution with mean zero, then minus the log-likelihood is given by

\[\frac{n}2 \log(2 \pi \sigma^2) + \frac1{2\sigma^2} \sum_i \| {\bf f}(\theta; {\bf x}_i) - {\bf y}_i \|^2\]

A disadvantage of the least-square loss function is its sensitivity to outliers. Indeed, if the \(k^\text{th}\) data point is an outlier, the square deviation will dominate the loss function and during the training phase, the optimization will give too much importance to that data point in an attempt to minimize the loss function. This is related to the fact that the least-square targets the mean of the distribution, which is sensitive to outliers.

This is in contrast to the $L_1$ loss function, which targets the median the distribution, and is

\[c(\theta; {\bf x}_i, {\bf y}_i) = | {\bf f}(\theta; {\bf x}_i) - {\bf y}_i |\]

The main disadvantage of the $L_1$ loss function is its numerical difficulties. It is non-differentiable at the origin, and requires specific techniques to be handled.

Binary classifier

Typically, loss functions for binary classification derive from MLE. Assuming the binary classes are \(y_i \in \{0,1\}\), and assuming that the classifier returns a function \(f({\bf x}_i) \in [0,1]\) which can be interpreted as \(\mathcal{P}[f({\bf x}_i) = 1 | \theta]\), then the likelihood function is given by

\[\begin{align*} & \prod_i (f({\bf x}_i))^{\mathcal{1}_{\{y_i=1\}}} (1-f({\bf x}_i))^{\mathcal{1}_{\{y_i=0\}}} \\ = & \prod_i (f({\bf x}_i))^{y_i} (1-f({\bf x}_i))^{1-y_i} \end{align*}\]

So either $y_i=1$ and we have only the first term, or $y_i=0$ and we have the second term. This is the general way of writing the likelihood function for a Bernoulli distribution which returns value $1$ with probability $f({\bf x}_i)$ and value $0$ with probability $1-f({\bf x}_i)$. Then the negative log-likelihood is

\[- \sum_i y_i \log(f({\bf x}_i)) + (1-y_i) \log(1-f({\bf x}_i))\]

Let’s look at an example. In the case of a logistic regression, the model is given by

\[f({\bf x}_i) = \frac1{1 + e^{-{\bf x}_i^T \theta}}\]

In that case, the negative log-likelihood is

\[\begin{align*} & = y_i \log (1 + e^{-{\bf x}_i^T \theta}) -(1-y_i) (-{\bf x}_i^T \theta - \log(1 + e^{-{\bf x}_i^T \theta}) \\ & = - y_i {\bf x}_i^T \theta + \log(1 + e^{+{\bf x}_i^T \theta}) \end{align*}\]

In the case we have multiple classes \(y_i \in \{0,1,\dots,K-1\}\) and multiple outputs \({\bf f}({\bf x}_i) \in [0,1]^K\), and all outputs are mutually exclusive (i.e., they always sum to $1$, as is the case when using softmax in a neural network), we have the interpretation that \({\bf f}({\bf x}_i)_j = \mathcal{P}({\bf f}({\bf x}_i) = y_j | \theta)\). The loss function is typically defined as

\[c(\theta; {\bf x}_i, y_i) = - \sum_{j=0}^K \mathcal{1}_{\{y_i = j\}} \log ( \mathcal{P}({\bf f}({\bf x}_i) = j | \theta) )\] \[c(\theta; {\bf x}_i, y_i) = - \log( {\bf f}({\bf x}_i)_{y_i} )\]

In that case, the partial derivative of the $c$ with respect to the output of the model is

\[\frac{\partial c(\theta; {\bf x}_i, y_i)}{\partial {\bf f}(\theta; {\bf x}_i)} = [\dots,0,- 1/{\bf f}({\bf x}_i)_{y_i},0,\dots]^T\]

It is interesting to note that these loss functions are often called cross-entropy, as they can be derived from information theory.

[ ML  optimization  ]