29 Jul 2019
The Box-Cox transformation is a parametric transformation that includes the
logarithmic transfomration as a special case, and is defined as
\[w_t = \left\{
\begin{aligned}
& \log(y_t) , \text{ if } \lambda = 0 \\
& \frac{(y_t)^\lambda - 1}{\lambda} , \text{ otherwise}
\end{aligned} \right.\]
Quick node: the inclusion of the log-transformation is justified by the fact
that $(y_t)^\lambda \approx 1 + \lambda \log(y_t)$ when $\lambda \rightarrow 0$.
Assuming you get better distribution with $w_t$, you can run your inference on
that variable. But once this is done, you still need to revert it back to the
quantity of interest, that is $y_t$. The inverse Box-Cox transformation is given
by
\[y_t = \left\{
\begin{aligned}
& \exp(w_t) , \text{ if } \lambda = 0 \\
& \left(\lambda w_t + 1 \right)^{1/\lambda} , \text{ otherwise}
\end{aligned} \right.\]
However, one needs to be careful about the distribution of the inverted
prediction~$y_t$.
A general (non-parametric) way of handling this, and the way chosen by
Hyndman, is to do a
Taylor expansion around the mean. That is, calling $f$ the inverted Box-Cox
transformation, and calling $\mu$ and $\sigma^2$ the mean and variance of the
transformed variable $w_t$, we would have
\(y_t \approx f(w_t) = f(\mu) + (w_t - \mu) f'(\mu) + \frac12 (w_t - \mu)^2
f''(\mu).\) Then taking the mean of that expression, we get
\(\mathbb{E}[y_t] = f(\mu) + \frac12 \sigma^2 f''(\mu).\)
However in the special case of the log-transformation, and if the
transformed variable $w_t$ was assumed to be normal (large class of models will
make that assumption when calculating the uncertainty around the mean), we can
simply use the results of a
log-Normal, which tells
us that (1) $\exp(w_t)$ is the median of $y_t$, and (2) the mean of $y_t$ is
given by $\exp(\mu + \sigma^2/2)$. In the case $\sigma^2 \ll 1$, we recover the
expression derived from the Taylor expression.
10 May 2019
It is possible to perform full-on Bayesian inference in Deep Learning. One could
train a network, assume some sort of Gaussian distribution around the weights
that were found, and the likelihood comes directly from the loss function
since, in general, that loss function is derived from MLE. However, this is
costly, and in practice it’s not clear whether people do that or not.
There has been some attempts at using variational inference, i.e., approximating
the posterior with a simpler distribution (e.g., Gaussian), by looking for the
candidate distribution that minimizes the Kullback-Leibler divergence between
the posterior and the candidate distribution. It also seems to be
computationally expensive.
The most popular option is to use Monte-Carlo Dropout (MC Dropout) at inference
time (see here and
here).
The idea is to generate samples of the solution of the NN by randomly shutting
down a certain numbers of cells for each forward propagation. The author proves
some interesting properties of their methods.
More recently, Stochastic Weight Averaging (SWA) was introduced (see
paper and blog
post).
SWA is not, so to speak, a new optimization algorightm, it simply modifies the
typical learning rate strategy (high learning rate, following by continuously
decreasing rate, until a very slow learning rate) to always maintain a
relatively large learning rate and continue to explore the loss function. And
after the training procedure, the latest weights, i.e., corresponding to the
exploration phase, are averaged out. That average is designed to position the
SWA weight in the middle of a large flat region (the authors assume that NN
minima are in large flat portions of the loss function), instead of close to the
edge (typical optim, e.g., SGD), leading to better generalization properties.
You have the option, in the exploration phase, to only includes a regular
sub-set of the weights.
There is potential issue when SWA is issued in conjunction with batch
normalization; because batch normalization learns the statistics of the
activation during training, but the SWA weights were never used during that
phase. The solution proposed by the authors is to re-pass the training set
through the network after training (i.e., with SWA weights), and re-calculate
the BN statistics at that time.
The authors also have an extension called SWA-Gaussian (SWAG) that allows to
carry uncertainty quantification by estimating the first 2 moments of the weight
distribution, again based on snapshots of the exploration phase.
01 Mar 2019
To install it
As often, the easiest way to install it was to use anaconda. It is even
documented in the
pyomo
documentation.
01 Mar 2019
The main challent of tracking notebooks with git is that every time you
re-generate an output, git will keep track of that. It makes each commit large,
and hard to keep track of.
Fortunately, there is a script that removes the output prior to your commit. I’m
using nbstripout. You can turn it on by
doing, inside your git repo,
and turn it off by doing
It works great. Unfortunately, it doesn’t behave super well when switching to
other branches where nbstripout was not applied; basically, it removes the
output from every notebook it fines, and then requires you to git commit
before you can switch out of that branch. Not great. So instead, I apply it
manually to the notebook that I want to commit. You can do so by typing
04 Feb 2019
I just want to summarize a few results that I find useful when dealing with
arithmetic and geometric Brownian motions.
For simplicity, I looked at unitary time steps. If it’s not the case (e.g., daily
rate of change, looked hourly), then some minor adjustment needs to be done.
Arithmetic Brownian motion
The SDE for the ABM is given by
\(dI_t = \mu I_0 d_t + \sigma I_0 dW_t\)
The solution is then
\(I_d = I_0 (1 + \mu d + \sigma W_d)\)
where $W_d$ is a Brownian motion. This means the following moments for the
quantity $I_d$,
\[\begin{aligned}
\mathbb{E}[I_d] & = I_0 (1 + \mu d) \\
\text{Var}[I_d] & = \sigma^2 I_0^2 d \\
\text{Cov}[I_d, I_{d'}] & = \sigma^2 I_0^2 \min(d,d') \\
\end{aligned}\]
Geometric Brownian motion
You can find some info from Wikipedia,
here
and
here.
The SDE for GBM is
\(dI_t = \mu I_t d_t + \sigma I_t dW_t\).
Then the solution is
\(I_d = I_0 \exp \left( (\mu - \sigma^2/2)d + \sigma W_d \right)\).
This means the following moments for the
quantity $I_d$,
\[\begin{aligned}
\mathbb{E}[I_d] & = I_0 e^{\mu d} \\
\text{Var}[I_d] & = I_0^2 e^{2\mu d}(e^{\sigma^2 d} - 1) \\
\text{Cov}[I_d, I_{d'}] & = I_0^2 e^{\mu(d+d')} \left( e^{\sigma^2 \min(d,d')}
-1 \right)
\end{aligned}\]
For the covariance, you can derive it as
\[\begin{aligned}
\text{Cov}[I_d, I_{d'}] & = \mathbb{E}[I_d I_{d'}] - \mathbb{E}[I_d]
\mathbb{E}[I_{d'}] \\
& = I_0^2 e^{(\mu-\sigma^2/2)(d+d')} \mathbb{E}[e^{\sigma(W_d + W_{d'})}] -
I_0^2 e^{\mu(d+d')}
\end{aligned}\]
And the last part is equal to $\mathbb{E}[e^Y]$ where $Y$ is
log-normal
with
parameters $\log Y \sim \mathcal{N}(0, \sigma^2(d+d’+2\min(d,d’)))$, such that
$\mathbb{E}[e^Y] = \exp[\sigma^2/2(d+d’+2\min(d,d’)]$ and the result follows.
It is interesting to convert the variables $\mu, \sigma$ into the empirical
moments of the quantity $I_d$. This is
\[\begin{aligned}
\mu & = \frac1d \log \frac{\mathbb{E}[I_d]}{I_0} \\
\sigma^2 & = \frac1d \log \left( 1 + \frac{\text{Var}[I_d]}{I_0^2 e^{2 \mu d}} \right)
\end{aligned}\]