Bias adjustment in Box-Cox transformation

The Box-Cox transformation is a parametric transformation that includes the logarithmic transfomration as a special case, and is defined as

\[w_t = \left\{ \begin{aligned} & \log(y_t) , \text{ if } \lambda = 0 \\ & \frac{(y_t)^\lambda - 1}{\lambda} , \text{ otherwise} \end{aligned} \right.\]

Quick node: the inclusion of the log-transformation is justified by the fact that $(y_t)^\lambda \approx 1 + \lambda \log(y_t)$ when $\lambda \rightarrow 0$.

Assuming you get better distribution with $w_t$, you can run your inference on that variable. But once this is done, you still need to revert it back to the quantity of interest, that is $y_t$. The inverse Box-Cox transformation is given by

\[y_t = \left\{ \begin{aligned} & \exp(w_t) , \text{ if } \lambda = 0 \\ & \left(\lambda w_t + 1 \right)^{1/\lambda} , \text{ otherwise} \end{aligned} \right.\]

However, one needs to be careful about the distribution of the inverted prediction~$y_t$. A general (non-parametric) way of handling this, and the way chosen by Hyndman, is to do a Taylor expansion around the mean. That is, calling $f$ the inverted Box-Cox transformation, and calling $\mu$ and $\sigma^2$ the mean and variance of the transformed variable $w_t$, we would have

\(y_t \approx f(w_t) = f(\mu) + (w_t - \mu) f'(\mu) + \frac12 (w_t - \mu)^2 f''(\mu).\) Then taking the mean of that expression, we get \(\mathbb{E}[y_t] = f(\mu) + \frac12 \sigma^2 f''(\mu).\)

However in the special case of the log-transformation, and if the transformed variable $w_t$ was assumed to be normal (large class of models will make that assumption when calculating the uncertainty around the mean), we can simply use the results of a log-Normal, which tells us that (1) $\exp(w_t)$ is the median of $y_t$, and (2) the mean of $y_t$ is given by $\exp(\mu + \sigma^2/2)$. In the case $\sigma^2 \ll 1$, we recover the expression derived from the Taylor expression.

Bayesian inference in Deep Learning

It is possible to perform full-on Bayesian inference in Deep Learning. One could train a network, assume some sort of Gaussian distribution around the weights that were found, and the likelihood comes directly from the loss function since, in general, that loss function is derived from MLE. However, this is costly, and in practice it’s not clear whether people do that or not.

There has been some attempts at using variational inference, i.e., approximating the posterior with a simpler distribution (e.g., Gaussian), by looking for the candidate distribution that minimizes the Kullback-Leibler divergence between the posterior and the candidate distribution. It also seems to be computationally expensive.

The most popular option is to use Monte-Carlo Dropout (MC Dropout) at inference time (see here and here). The idea is to generate samples of the solution of the NN by randomly shutting down a certain numbers of cells for each forward propagation. The author proves some interesting properties of their methods.

More recently, Stochastic Weight Averaging (SWA) was introduced (see paper and blog post). SWA is not, so to speak, a new optimization algorightm, it simply modifies the typical learning rate strategy (high learning rate, following by continuously decreasing rate, until a very slow learning rate) to always maintain a relatively large learning rate and continue to explore the loss function. And after the training procedure, the latest weights, i.e., corresponding to the exploration phase, are averaged out. That average is designed to position the SWA weight in the middle of a large flat region (the authors assume that NN minima are in large flat portions of the loss function), instead of close to the edge (typical optim, e.g., SGD), leading to better generalization properties. You have the option, in the exploration phase, to only includes a regular sub-set of the weights. There is potential issue when SWA is issued in conjunction with batch normalization; because batch normalization learns the statistics of the activation during training, but the SWA weights were never used during that phase. The solution proposed by the authors is to re-pass the training set through the network after training (i.e., with SWA weights), and re-calculate the BN statistics at that time.

The authors also have an extension called SWA-Gaussian (SWAG) that allows to carry uncertainty quantification by estimating the first 2 moments of the weight distribution, again based on snapshots of the exploration phase.

Installing pyomo

To install it

As often, the easiest way to install it was to use anaconda. It is even documented in the pyomo documentation.

Tracking notebooks with git

The main challent of tracking notebooks with git is that every time you re-generate an output, git will keep track of that. It makes each commit large, and hard to keep track of. Fortunately, there is a script that removes the output prior to your commit. I’m using nbstripout. You can turn it on by doing, inside your git repo,

nbstripout --install

and turn it off by doing

nbstripout --install

It works great. Unfortunately, it doesn’t behave super well when switching to other branches where nbstripout was not applied; basically, it removes the output from every notebook it fines, and then requires you to git commit before you can switch out of that branch. Not great. So instead, I apply it manually to the notebook that I want to commit. You can do so by typing

nbstripout FILE.ipynb

Note on Brownian motion

I just want to summarize a few results that I find useful when dealing with arithmetic and geometric Brownian motions. For simplicity, I looked at unitary time steps. If it’s not the case (e.g., daily rate of change, looked hourly), then some minor adjustment needs to be done.

Arithmetic Brownian motion

The SDE for the ABM is given by \(dI_t = \mu I_0 d_t + \sigma I_0 dW_t\)

The solution is then \(I_d = I_0 (1 + \mu d + \sigma W_d)\) where $W_d$ is a Brownian motion. This means the following moments for the quantity $I_d$,

\[\begin{aligned} \mathbb{E}[I_d] & = I_0 (1 + \mu d) \\ \text{Var}[I_d] & = \sigma^2 I_0^2 d \\ \text{Cov}[I_d, I_{d'}] & = \sigma^2 I_0^2 \min(d,d') \\ \end{aligned}\]

Geometric Brownian motion

You can find some info from Wikipedia, here and here. The SDE for GBM is \(dI_t = \mu I_t d_t + \sigma I_t dW_t\). Then the solution is \(I_d = I_0 \exp \left( (\mu - \sigma^2/2)d + \sigma W_d \right)\). This means the following moments for the quantity $I_d$,

\[\begin{aligned} \mathbb{E}[I_d] & = I_0 e^{\mu d} \\ \text{Var}[I_d] & = I_0^2 e^{2\mu d}(e^{\sigma^2 d} - 1) \\ \text{Cov}[I_d, I_{d'}] & = I_0^2 e^{\mu(d+d')} \left( e^{\sigma^2 \min(d,d')} -1 \right) \end{aligned}\]

For the covariance, you can derive it as

\[\begin{aligned} \text{Cov}[I_d, I_{d'}] & = \mathbb{E}[I_d I_{d'}] - \mathbb{E}[I_d] \mathbb{E}[I_{d'}] \\ & = I_0^2 e^{(\mu-\sigma^2/2)(d+d')} \mathbb{E}[e^{\sigma(W_d + W_{d'})}] - I_0^2 e^{\mu(d+d')} \end{aligned}\]

And the last part is equal to $\mathbb{E}[e^Y]$ where $Y$ is log-normal with parameters $\log Y \sim \mathcal{N}(0, \sigma^2(d+d’+2\min(d,d’)))$, such that $\mathbb{E}[e^Y] = \exp[\sigma^2/2(d+d’+2\min(d,d’)]$ and the result follows.

It is interesting to convert the variables $\mu, \sigma$ into the empirical moments of the quantity $I_d$. This is

\[\begin{aligned} \mu & = \frac1d \log \frac{\mathbb{E}[I_d]}{I_0} \\ \sigma^2 & = \frac1d \log \left( 1 + \frac{\text{Var}[I_d]}{I_0^2 e^{2 \mu d}} \right) \end{aligned}\]