21 Oct 2020
I started looking into the Transformer model. It was first introduced in the
paper Attention is all you
need. I don’t
have time to dive into the details, but one key part of the paper is the
generalization of the concept of attention. In the
paper that made attention works,
attention is applied to an encoder-decoder model, and the attention score is
calculated as a linear combination of all the hidden states of the encoder. That
linear combination is calculated by another neural network (feedforward, I
believe) that takes as input the current hidden state of the decoder and all the
hidden states of the encoder). The output is then passed to a softmax, then that
score multiply each hidden states of the encoder, which we call the values. So
the attention score is
\[ Attention = \sum_i Scores_i Values_i \]
where $Scores \in \mathbb{R}^n, Values \in \mathbb{R}^{n \times d_v}$, and $d_v$
is the number of hidden cells.
The problem of that approach is that you actually calculate each attention score
one point at a time, so if you have $m$ words in your input sequence and $n$
words in your output sequence, you’ll pass through that network $m \times n$,
which can get slow pretty quickly.
So in the Tranformers paper, the authors replace the scores generated by a
neural network with a dot product. To do so, they introduce the new concepts of
queries and keys. One analogy that I heard to describe is that would like a
fuzzy lookup in a dict: if your query matches a key, no problem, but if not,
this fuzzy dict will return a mix of the values whose keys resemble the most the
query. That is the idea of the attention mechanism.
So with symbols, the attention mechanism becomes
\[ Attention = \text{softmax}(\frac{Q K^T}{\sqrt{d_k}}) V \]
where $Q \in \mathbb{R}^{m \times d_k}, K \in \mathbb{R}^{n \times d_k}, V \in
\mathbb{R}^{n \times d_v}$.
In the initial attention paper, the keys and values are the same, the query is
the hidden cell from the decoder, and the dot
product $Q K^T$ is replaced with a neural network that takes $Q, K$ as inputs.
Additional References:
06 Oct 2020
I somehow always forget these things. So I’ll mark it down here once and for
all.
When looking for parameter $\beta$ in a linear system
\[ y = X . \beta + e \]
where $y \in \mathbb{R}^{n \times 1}, X \in \mathbb{R}^{n \times m}, \beta \in
\mathbb{R}^{m \times 1}$, and noise $e \sim \mathcal{N}(0, \sigma^2)$, it is
formulated as the following linear least squares problem
\[ \hat{\beta} = \arg \min_\beta || X . \beta - y ||_2^2 \]
When $X$ is full rank
When $X$ is full rank, then we can just solve the normal equation, by solving
the first-order optimality condition of the least-squares problem,
\[ X^T . (X . \hat{\beta} - y) = 0 \]
which gives the solution
\[ \hat{\beta} = (X^T.X)^{-1} . X^T.y \]
This requires $X$ to be full rank so that $X^T.X$ is invertible.
When $X$ is rank deficient
If $X$ doesn’t have full rank, then we don’t have a unique answer $\hat{\beta}$.
Instead we have an entire linear space of solution. That is given any solution
$\hat{\beta}$, then any non-trivial vector $x$ in the null space of $X$, $\hat{\beta} + x$
will still be a solution; the fact that such a non-trivial $x$ exists is due to
the fact that $X$ is rank deficient.
All of this matters if you care about $\hat{\beta}$, as it doesn’t have a unique
value anymore.
But if all you care about is transforming a given $X$ into a $y$, then that’s no
problem. All you need is to fine a (any) solution $\hat{\beta}$. To do this, we
typically rely on the
pseudo-inverse.
Let’s see why. When $X$ is rank-deficient, we can still write its singular value
decomposition, $X = U.\Sigma.V^\star$, where $U \in \mathbb{R}^{n \times n},
\Sigma \in \mathbb{R}^{n \times m}, V\in\mathbb{R}^{m \times m}$.
$U, V$ are unitary matrices.
$\Sigma$
is a diagonal matrix that contains the singular values (typically written in
decreasing order). Since $X$ is rank-deficient, some of these singular values
are zero. Let’s assume $rank(X) = r$. Then the reduced SVD is obtained by
keeping the first $r$ columns of $U$ and $V$, and keeping only the first $r$
singular values of $\Sigma$. We write this as $X = U_r . \Sigma_r . V_r^\star$,
where $U_r \in \mathbb{R}^{n\times r}, V_r \in \mathbb{R}^{m \times r}$ and
$\Sigma \in \mathbb{R}^{r \times r}$. Replacing this reduced SVD into the
least-squares formulation, we can left multiply by a unitary matrix $U^\star$.
Due to the orthogonality of the columns of $U$, we get
\[ \arg \min_\beta || \Sigma_r . V_r^\star . \beta - U_r^\star . y ||^2 +
|| - U_{n-r}^\star . y ||^2 \]
Since the second term does not depend on $\beta$, it won’t impact the value of
$\beta$ such that we can remove it. Also, we can left-multiply by
$\Sigma_r^{-1}$ which is invertible since all singular values in $\Sigma_r$ are
non-zero. We get
\[ \arg \min_\beta || V_r^\star . \beta - \Sigma_r^{-1} . U_r^\star . y ||^2 \]
And we’re done. Well, almost. Now we see where the non-unicity lies. The term
$V_r^\star . \beta$ is uniquely defined by this equation. And the least-squares
equation will be minimum (actually zero) when
\[ V_r^\star . \beta = \Sigma_r^{-1} . U_r^\star . y \]
However, $\beta$ is not uniquely defined. Indeed, let’s assume we know a
solution $\hat{\beta}$ to the equation above. Then adding any vector of the form
$v = V_{m-r} . z$, we still get a solution.
We can do though is to look for a special solution, and typically we look for
the solution with the minimum norm. First, we can easily verify that
\[ \hat{\beta} = V_r . \Sigma_r^{-1} . U_r^\star . y \]
is a solution to the minimization problem above, as it will lead to a zero norm.
Now this is the unique solution in the subspace described by the $V_r$. But any
other solution of the form $\hat{\beta} + V_{m-r} . z$ will have a larger norm,
since $V_r$ and $V_{m-r}$ are orthogonal to each other.
So we found the minimum norm solution. It is often written in terms of the
pseudo-inverse of $X$, i.e., $\hat{\beta} = X^+ . y$, where $X^+ = V . \Sigma^+
. U^\star$, where $\Sigma^+$ is a diagonal matrix with all singluar values
inverted, except when they were zero in which case they are left unchanged. We
can easily verify that the reduced form of the pseudo-inverse is indeed $V_r .
\Sigma_r^{-1} . U_r^\star$.
05 Oct 2020
Rob Hyndman has a blog post
where he details the difference between confidence interval, prediction
interval, and credible interval.
I think he makes an good point that confidence interval and prediction interval
are often used inter-changeably, even they are quite different. Confidence
interval comes from the realm of frequentist inference, and applies to a
parameter that has been evaluated using a statistical method. A 95% confidence
interval is an interval that will contain the true value of that parameter 95%
of the time, if we could repeat the experiment indefinitly.
A prediction interval is an interval for a predicted value. Not for a model
parameter. That predicted value does not exist yet (that’s why we need to
predict it), and its uncertainty is represented by a random variable. A 70%
prediction interval will contain 70% of the mass of that random variable; or to
be more general, this would be the 70% HDI.
A credibility interval is kind of a mix between the two. It’s the Bayesian
equivalent of a confidence interval, and therefore applies to the posterior
distribution of a model parameter. A 75% credibility interval is the 75% HDI of
the posterior distribution.
11 Aug 2020
You can pass local environment variables to your ssh session, and even use those
to define environment variables inside your ssh session. For that, you need to
execute a command which requires the flag -t
prior to your server address. An
example,
ssh -p 2222 -A -t myname@myserver "SERVER_ENV_VAR=$LOCAL_ENV_VAR; export SERVER_ENV_VAR; bash -l"
After that, you should see an environemnt variable called SERVER_ENV_VAR
inside your ssh session.
04 Aug 2020
In order to render rst files in pycharm, you need to set up some environment
variables, otherwise you’ll get the error NameError: Cannot find docutils in selected interpreter.
In short, you need to add
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
to your .bash_profile
. You can then restart pycharm and it should work.
Ref: Stackoverflow