Transformer

21 Oct 2020

I started looking into the Transformer model. It was first introduced in the paper Attention is all you need. I don’t have time to dive into the details, but one key part of the paper is the generalization of the concept of attention. In the paper that made attention works, attention is applied to an encoder-decoder model, and the attention score is calculated as a linear combination of all the hidden states of the encoder. That linear combination is calculated by another neural network (feedforward, I believe) that takes as input the current hidden state of the decoder and all the hidden states of the encoder). The output is then passed to a softmax, then that score multiply each hidden states of the encoder, which we call the values. So the attention score is \[ Attention = \sum_i Scores_i Values_i \] where $Scores \in \mathbb{R}^n, Values \in \mathbb{R}^{n \times d_v}$, and $d_v$ is the number of hidden cells.

The problem of that approach is that you actually calculate each attention score one point at a time, so if you have $m$ words in your input sequence and $n$ words in your output sequence, you’ll pass through that network $m \times n$, which can get slow pretty quickly. So in the Tranformers paper, the authors replace the scores generated by a neural network with a dot product. To do so, they introduce the new concepts of queries and keys. One analogy that I heard to describe is that would like a fuzzy lookup in a dict: if your query matches a key, no problem, but if not, this fuzzy dict will return a mix of the values whose keys resemble the most the query. That is the idea of the attention mechanism. So with symbols, the attention mechanism becomes \[ Attention = \text{softmax}(\frac{Q K^T}{\sqrt{d_k}}) V \] where $Q \in \mathbb{R}^{m \times d_k}, K \in \mathbb{R}^{n \times d_k}, V \in \mathbb{R}^{n \times d_v}$. In the initial attention paper, the keys and values are the same, the query is the hidden cell from the decoder, and the dot product $Q K^T$ is replaced with a neural network that takes $Q, K$ as inputs.

Additional References:

Under determined least squares

06 Oct 2020

I somehow always forget these things. So I’ll mark it down here once and for all.

When looking for parameter $\beta$ in a linear system \[ y = X . \beta + e \] where $y \in \mathbb{R}^{n \times 1}, X \in \mathbb{R}^{n \times m}, \beta \in \mathbb{R}^{m \times 1}$, and noise $e \sim \mathcal{N}(0, \sigma^2)$, it is formulated as the following linear least squares problem \[ \hat{\beta} = \arg \min_\beta || X . \beta - y ||_2^2 \]

When $X$ is full rank

When $X$ is full rank, then we can just solve the normal equation, by solving the first-order optimality condition of the least-squares problem, \[ X^T . (X . \hat{\beta} - y) = 0 \] which gives the solution \[ \hat{\beta} = (X^T.X)^{-1} . X^T.y \] This requires $X$ to be full rank so that $X^T.X$ is invertible.

When $X$ is rank deficient

If $X$ doesn’t have full rank, then we don’t have a unique answer $\hat{\beta}$. Instead we have an entire linear space of solution. That is given any solution $\hat{\beta}$, then any non-trivial vector $x$ in the null space of $X$, $\hat{\beta} + x$ will still be a solution; the fact that such a non-trivial $x$ exists is due to the fact that $X$ is rank deficient. All of this matters if you care about $\hat{\beta}$, as it doesn’t have a unique value anymore.

But if all you care about is transforming a given $X$ into a $y$, then that’s no problem. All you need is to fine a (any) solution $\hat{\beta}$. To do this, we typically rely on the pseudo-inverse. Let’s see why. When $X$ is rank-deficient, we can still write its singular value decomposition, $X = U.\Sigma.V^\star$, where $U \in \mathbb{R}^{n \times n}, \Sigma \in \mathbb{R}^{n \times m}, V\in\mathbb{R}^{m \times m}$. $U, V$ are unitary matrices. $\Sigma$ is a diagonal matrix that contains the singular values (typically written in decreasing order). Since $X$ is rank-deficient, some of these singular values are zero. Let’s assume $rank(X) = r$. Then the reduced SVD is obtained by keeping the first $r$ columns of $U$ and $V$, and keeping only the first $r$ singular values of $\Sigma$. We write this as $X = U_r . \Sigma_r . V_r^\star$, where $U_r \in \mathbb{R}^{n\times r}, V_r \in \mathbb{R}^{m \times r}$ and $\Sigma \in \mathbb{R}^{r \times r}$. Replacing this reduced SVD into the least-squares formulation, we can left multiply by a unitary matrix $U^\star$. Due to the orthogonality of the columns of $U$, we get \[ \arg \min_\beta || \Sigma_r . V_r^\star . \beta - U_r^\star . y ||^2 + || - U_{n-r}^\star . y ||^2 \] Since the second term does not depend on $\beta$, it won’t impact the value of $\beta$ such that we can remove it. Also, we can left-multiply by $\Sigma_r^{-1}$ which is invertible since all singular values in $\Sigma_r$ are non-zero. We get \[ \arg \min_\beta || V_r^\star . \beta - \Sigma_r^{-1} . U_r^\star . y ||^2 \] And we’re done. Well, almost. Now we see where the non-unicity lies. The term $V_r^\star . \beta$ is uniquely defined by this equation. And the least-squares equation will be minimum (actually zero) when \[ V_r^\star . \beta = \Sigma_r^{-1} . U_r^\star . y \] However, $\beta$ is not uniquely defined. Indeed, let’s assume we know a solution $\hat{\beta}$ to the equation above. Then adding any vector of the form $v = V_{m-r} . z$, we still get a solution. We can do though is to look for a special solution, and typically we look for the solution with the minimum norm. First, we can easily verify that \[ \hat{\beta} = V_r . \Sigma_r^{-1} . U_r^\star . y \] is a solution to the minimization problem above, as it will lead to a zero norm. Now this is the unique solution in the subspace described by the $V_r$. But any other solution of the form $\hat{\beta} + V_{m-r} . z$ will have a larger norm, since $V_r$ and $V_{m-r}$ are orthogonal to each other. So we found the minimum norm solution. It is often written in terms of the pseudo-inverse of $X$, i.e., $\hat{\beta} = X^+ . y$, where $X^+ = V . \Sigma^+ . U^\star$, where $\Sigma^+$ is a diagonal matrix with all singluar values inverted, except when they were zero in which case they are left unchanged. We can easily verify that the reduced form of the pseudo-inverse is indeed $V_r . \Sigma_r^{-1} . U_r^\star$.

Prediction interval vs Confidence interval

05 Oct 2020

Rob Hyndman has a blog post where he details the difference between confidence interval, prediction interval, and credible interval.

I think he makes an good point that confidence interval and prediction interval are often used inter-changeably, even they are quite different. Confidence interval comes from the realm of frequentist inference, and applies to a parameter that has been evaluated using a statistical method. A 95% confidence interval is an interval that will contain the true value of that parameter 95% of the time, if we could repeat the experiment indefinitly.

A prediction interval is an interval for a predicted value. Not for a model parameter. That predicted value does not exist yet (that’s why we need to predict it), and its uncertainty is represented by a random variable. A 70% prediction interval will contain 70% of the mass of that random variable; or to be more general, this would be the 70% HDI.

A credibility interval is kind of a mix between the two. It’s the Bayesian equivalent of a confidence interval, and therefore applies to the posterior distribution of a model parameter. A 75% credibility interval is the 75% HDI of the posterior distribution.

Pass environment variables in your ssh session

11 Aug 2020

You can pass local environment variables to your ssh session, and even use those to define environment variables inside your ssh session. For that, you need to execute a command which requires the flag -t prior to your server address. An example,

ssh -p 2222 -A -t myname@myserver "SERVER_ENV_VAR=$LOCAL_ENV_VAR; export SERVER_ENV_VAR; bash -l"

After that, you should see an environemnt variable called SERVER_ENV_VAR inside your ssh session.

rst in pycharm

04 Aug 2020

In order to render rst files in pycharm, you need to set up some environment variables, otherwise you’ll get the error NameError: Cannot find docutils in selected interpreter. In short, you need to add

export LC_ALL=en_US.UTF-8  
export LANG=en_US.UTF-8

to your .bash_profile. You can then restart pycharm and it should work.

Ref: Stackoverflow

Older Newer

Fourre-tout