Bro, do you even backprop?

The goal of this post is to dive deeper into the derivation of the backpropagation algorithm. This entry can be seen as a sub-post of that main post discussing a Python implementation of an artificial neural network, and all notations will be exactly the same. For conventions and definitions about vector/matrix calculus, see that post.

The loss function is defined as the sum of individual contributions for each data point $i$,

\[\mathcal{L}(\{ {\bf W}^l, {\bf b}^l \}_l) = \frac1N \sum_{i=1}^N c(\{ {\bf W}^l, {\bf b}^l \}_l, {\bf x}_i, {\bf y}_i)\]

For examples of loss functions, have a look at that post. In this post, I’m going to focus only on how to calculate the derivatives \(\frac{\partial c({\bf x}_i)}{\partial {\bf b}^l}\) and \(\frac{\partial c({\bf x}_i)}{\partial {\bf W}^l}\).

Derivatives with respect to ${\bf b}^l$

Applying the chain-rule, we get that

\[\frac{\partial c({\bf x}_i)}{\partial {\bf b}^l} = \frac{\partial {\bf a}^L}{\partial {\bf b}^l} \cdotp \frac{\partial c({\bf x}_i)}{\partial {\bf a}^L}\]

For the output layer, i.e., $l=L$, since ${\bf a}^L = {\bf W}^L \cdotp {\bf a}^{L-1} + {\bf b}^L$, applying the chain-rule again with $z^L_k$, we have

\[\frac{\partial {\bf a}^L}{\partial b^L_j} = \sum_{k=1}^{n_L} \frac{\partial {\bf a}^L}{\partial z_k^L} \frac{\partial z_k^L}{\partial b^L_j}\]

Let ${\bf e}_k = [\dots,0,1,0,\dots]^T$ be the vector filled with all zeros but for one 1 at the $k^\text{th}$ entry. Then, we have

\[\frac{\partial {\bf a}^L}{\partial z_k^L} = f'(z_k^L) {\bf e}_k, \quad \frac{\partial z_k^L}{\partial b^L_j} = \delta_{jk}\]

This means

\[\frac{\partial {\bf a}^L}{\partial b^L_j} = f'(z_j^L) {\bf e}_j\]

Following the convention defined here, we’ll stack these vectors row by row to get

\[\frac{\partial {\bf a}^L({\bf x}_i)}{\partial {\bf b}^L} = \begin{bmatrix} f'( z_1^L({\bf x}_i) ) & & 0 \\ & \ddots & \\ 0 & & f'(z_{n_L}^L({\bf x}_i)) \end{bmatrix}\]

In the end, with \(z_k^L({\bf x}_i) = ({\bf w}^L_k)^T \cdotp {\bf a}^{L-1}({\bf x}_i) + b^L_k\), we get

\[\frac{\partial c({\bf x}_i)}{\partial {\bf b}^L} = \begin{bmatrix} f'( z_1^L({\bf x}_i) ) & & 0 \\ & \ddots & \\ 0 & & f'(z_{n_L}^L({\bf x}_i)) \end{bmatrix} \cdotp \frac{\partial c({\bf x}_i)}{\partial {\bf a}^L}\]

Using the results in the post loss function, we can show actual examples. For a least-square loss function, we have \(\frac{\partial c({\bf x}_i)}{\partial {\bf a}^L} = ({\bf a}^L({\bf x}_i) - {\bf y}_i)\). Whereas in the case of a multi-class classifier with a softmax output layer (which is not actually covered by the above formula), we would have \(\frac{\partial c({\bf x}_i)}{\partial {\bf a}^L} = a^L({\bf x}_i)_{y_i} {\bf e}_{y_i}\).

For an interior layer $l$, we need to write an induction between successive layers. But first, let’s make a few observations. In the derivatives, the variables ${\bf z}^l$ and ${\bf b}^l$ are inter-changeable. Indeed, since \(\frac{\partial {\bf z}^l}{\partial {\bf b}^l} = I_{n_l}\), we have

\[\frac{\partial \cdotp}{\partial {\bf b}^l} = \frac{\partial {\bf z}^l}{\partial {\bf b}^l} \frac{\partial \cdotp}{\partial {\bf z}^l} = \frac{\partial \cdotp}{\partial {\bf z}^l}\]

Next, the result for \(\frac{\partial {\bf a}^L}{\partial {\bf b}^L}\) derived above can be generalized to any layer $l$, i.e.,

\[\frac{\partial {\bf a}^l({\bf x}_i)}{\partial {\bf b}^l} = \begin{bmatrix} f'( z_1^l({\bf x}_i) ) & & 0 \\ & \ddots & \\ 0 & & f'(z_{n_l}^l({\bf x}_i)) \end{bmatrix}\]

We now apply the chain-rule,

\[\begin{align*} \frac{\partial c({\bf x}_i)}{\partial {\bf b}^l} & = \frac{\partial c({\bf x}_i)}{\partial {\bf z}^l} \\ & = \frac{\partial {\bf a}^l}{\partial {\bf z}^l} \frac{\partial {\bf z}^{l+1}}{\partial {\bf a}^l} \frac{\partial c({\bf x}_i)}{\partial {\bf z}^{l+1}} \\ & = \frac{\partial {\bf a}^l}{\partial {\bf b}^l} \frac{\partial {\bf z}^{l+1}}{\partial {\bf a}^l} \frac{\partial c({\bf x}_i)}{\partial {\bf b}^{l+1}} \end{align*}\]

And in the end,

\[\frac{\partial c({\bf x}_i)}{\partial {\bf b}^l} = \begin{bmatrix} f'( z_1^l({\bf x}_i) ) & & 0 \\ & \ddots & \\ 0 & & f'(z_{n_l}^l({\bf x}_i)) \end{bmatrix} \cdotp ({\bf W}^{l+1})^T \cdotp \frac{\partial c({\bf x}_i)}{\partial {\bf b}^{l+1}}\]

Derivatives with respect to ${\bf W}^l$

I am going to re-use the results in the previous section. In fact, this is yet another application of the chain-rule,

\[\begin{align*} \frac{\partial c({\bf x}_i)}{\partial {\bf w}_i^l} & = \frac{\partial {\bf z}^l}{\partial {\bf w}_i^l} \frac{\partial c({\bf x}_i)}{\partial {\bf z}^l} \\ & = \begin{bmatrix} \dots & 0 & {\bf {a}}^{l-1}({\bf x}_i) & 0 & \dots \end{bmatrix} \cdotp \frac{\partial c({\bf x}_i)}{\partial {\bf b}^l} \\ & = {\bf a}^{l-1}({\bf x}_i) \frac{\partial c({\bf x}_i)}{\partial b^l_i} \end{align*}\]

Since \({\bf W}^l = [{\bf w}_1^l,\dots,{\bf w}_{n_l}^l]^T\), we stack all \(\frac{\partial c({\bf x}_i)}{\partial {\bf w}_i^l}\) in columns then take the transpose to get

\[\frac{\partial c({\bf x}_i)}{\partial {\bf W}^l} = \frac{\partial c({\bf x}_i)}{\partial {\bf b}^l} \cdotp ({\bf a}^{l-1}({\bf x}_i))^T\]

Comments

Dead neurons

It is informative to investigate what can cause the gradient of the cost functional to be zero. Note that this only applies to a single datapoint.

(1) that will be the case if the next layer is dead (all gradients evaluate to zero), b/c of the back-propagation recursion. So once you have a dead layer (all neurons in that layer are dead), then you stop backpropagating past that layer.

(2) that will also be the case if $f’(z^l_i) = 0$. With RELU, this will happen with negative values of ${\bf z}^l$, and for sigmoid / tanh, this will happen with very large values (either negative of positive).

(3) if a column of ${\bf W}^{l+1}$ is zero.

Parallelism

The backpropagation algorithm is inherently sequential, which makes its parallelism challenging, besides the embarassing parallelism of the sum over the data points. This is actually an advantage for batch algorithms, that require all $N$ points, over stochastic algorithms like stochastic gradient which compute the derivative for a single data point $i$.

Checkpointing

To compute the derivatives, you need to store all \({\bf a}^l({\bf x}_i)\) that were computed during the forward propagation. I’m wondering if memory storage could be an issue. In which case a checkpointing algorithm would be a natural solution

Higher-order derivatives

The next step is to look at how you can compute the action of a given vector on the Hessian of the loss function.

Statistical Hypothesis Tests

The goal of the post is to summarize a few important statistical hypothesis tests that come back often in practice.

t-tests

The first class of tests we are going to look at are t-tests, i.e., tests that involve a statistic following a Student’s t distribution.

One sample t-test

The simplest t-test, a one sample t-test, can be used to check whether the sample mean is equal to a certain value when all samples \(\{ X_i \}_i\) are drawn from distributions that have the same mean $\mu$ and variance $\sigma^2$. This is a variation of a z-test when the variance of the population is unknown and needs to be estimated from the population. Indeed, from the central limit theorem, we know the sample mean, $\bar{X} = \frac1n \sum_{i=1}^n X_i$, tends (as the number of samples increase) to a normal distribution, i.e., in the limit of a large number of samples, \[ \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim \mathcal{N}(0,1) \] Clearly we have $\mathbb{E}[\bar{X}] = \mu$, and if all samples are independent, $Var[\bar{X}] = \sigma^2/n$. However if we do not know $\sigma^2$ a priori, and use instead the sampling variance \(s^2 = \frac1{n-1} \sum_{i=1}^n (X_i - \bar{X})^2\), the distribution becomes \[ \frac{\bar{X} - \mu}{s/\sqrt{n}} \sim t(n-1) \]

To apply a one-sample t-test to the parameters of a linear regression, the variance is not scaled by the square root of the number of samples. The standard error is calculated directly from the regression. For an OLS, $Y = X.\beta + \varepsilon$, with $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$, the parameters are estimated by $b = (X^T \cdotp X)^{-1} (X^T \cdotp Y) = \beta + (X^T \cdotp X)^{-1} X^T \varepsilon$. The variance of the estiamte $b$ is then $Var[b] = \sigma^2 (X^T \cdotp X)^{-1}$. Since $\sigma^2$ is unknown, we use instead the sampling variance $s^2 = \frac1{n-1} \sum_{i=1}^n (y_i - \hat{y}_i)^2$, where $\hat{y}_i$ is the estimate for $y_i$.

Independent two sample t-test

We can use an independent two sample t-test to compare the population mean of two populations, when the samples for each population do not have a clear connection with each other. We’ll look at the case of the dependent two-sample t-test afterward. Again, we typically do not know the variance of each population and must resort to the sampling variance, which leads to a Student’s t distribution. The statistic we use here is \[ \frac{ \bar{X}_1 - \bar{X}_2}{s_d} \sim t(df) \]as where $s_d$ is the standard deviation and $df$ is the number of degrees of freedom. The correct values to use depend on the situation.

If the two populations have the same variance, we can use a pooled sample variance $s_p$ to compute $s_d$, i.e., \[ s_d = s_p \sqrt{1/n_1 + 1/n_2}, \] where \(s_p^2 = (\sum_i (n_i-1) s_i^2) / (\sum_j (n_j-1))\), and \[ df = n_1 + n_2 - 2 \]

If the two populations have different variance (and in the general case different number of samples), then \[ s_d = \sqrt{s_1^1/n_1 + s_2^2/n_2} . \] The exact distribution in that case is a mess, but for all practical cases it can be approximated by a Student’s t-test with degrees of freedom \[ df = \frac{(s_1^1/n_1 + s_2^2/n_2)^2}{(s_1^2/n_1)^2/(n_1-1) + (s_2^2/n_2)^2/(n_2-1)} \]

A dependent t-test for paired samples is when two samples are related to each other, for instance, if those are the same patients before and after a treatment. In that case, the solution is to test the difference of the pairs in a one-sample t-test.

Chi-square tests

There are a different flavours of chi-square tests. I’m here looking at the goodness-of-fit, i.e., test if a categorical population is distributed according to a theoretical distribution (null hypothesis). In that case, Pearson’s chi-square test is actually an approximation to the G-test, \[ G = 2 \sum_i O_i \ln (O_i/E_i) \sim \chi^2(df) \] where $O_i$ is the observed count for category $i$, $E_i$ is the expected count for category $i$ (under the null). The degrees of freedom is $df = $ number of categories $-$ (number of parameters in the distribution + 1); for instance, the number of parameters for a uniform distribution is 0, for a standard normal is 2,$\dots$

How do we get to the chi-square test from a G-test? We use the expansion $\ln(1+u) \approx u - u^2/2$ when $|u|\ll 1$. If we assume that $O_i$ and $E_i$ are close to each other, then \[ O_i \ln(E_i/O_i) = O_i \ln \left( 1 + \frac{E_i-O_i}{O_i} \right) \approx (E_i - O_i) - \frac{(E_i-O_i)^2}{2O_i} \] Going back to the G statistic, and using the fact that $\sum_i E_i = \sum_i O_i$, we have \[ G = -2 \sum_i O_i \ln (E_i/O_i) \approx \sum_i \frac{(O_i-E_i)^2}{O_i} \] which is the chi-square test.

ANOVA test

I won’t go much in details. The ANOVA test can be seen as a generalization of a two-sample t-test to the case of multiple populations. There is also a whole chapter dedicated to it in Casella & Berger’s Statistical Inference textbook.

Ljung-Box test

The Ljung-Box test is a portmanteau test to test the correlation of a time-series. It allows to test whether the first $h$ auto-correlations of a time-series are like white noise (null) or not (serial correlation). We define the statistic \[ Q = n (n+2) \sum_{k=1}^h \frac{\hat{\rho}_k^2}{n-k} \sim \chi^2(h) \] where $\hat{\rho}_k$ is the lag-k sample autocorrelation. Note that this is a one-sided test.

Unit root test

If a time-series has a unit root, it is, among other things, non-stationary. For instance, we can test for the presence of a unit root to decide whether a time-series needs to be differencied or not. The Dickey-Fuller test is a popular test for unit roots.

Conventions for Matrix Calculus

Notes

In optimization, you need to take derivatives. To do that in $\mathbb{R}^n$, you need matrix calculus. The objective of this note is to summarize the important definitions and conventions, explain them whenever possible, and show a few examples.

Frechet and Gateaux derivatives

Let’s define a function $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$. That function is Frechet differentiable at $x$ if there exists a bounded (necessarily the case for an operator between finite dimensional spaces) linear operator
$Df(x): \mathbb{R}^n \rightarrow \mathbb{R}^m$ such that \[ \frac{| f(x+h) - f(x) - Df(x)h |}{| h |} \rightarrow 0 \] as $|h| \rightarrow 0$.

A somehow weaker definition of differentiability is the Gateaux differentiability. A function $f$ is Gateaux differentiable at $x$ if, for any $v \in \mathbb{R}^n$ the following limit exists \[ lim_{\varepsilon \rightarrow 0} \frac{f(x + \varepsilon v) - f(x)} \varepsilon . \] If that limit exists, we can calculate it as \[ lim_{\varepsilon \rightarrow 0} \frac{f(x + \varepsilon v) - f(x)}\varepsilon = \frac{d}{d\varepsilon} f(x + \varepsilon v) |_{\varepsilon=0} . \]

If $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$ is Frechet differentiable, it is necessarily Gateaux differentiable, and \(\frac{d}{d \varepsilon} f(x + \varepsilon v) |_{\varepsilon=0} = Df(x) v\).

Gradient of a functional

Let’s call $H = \mathbb{R}^n$, which is a Hilbert space with inner product $\langle \cdotp, \cdotp \rangle$. For a functional $f: \mathbb{R}^n \rightarrow \mathbb{R}$, the derivative $Df(x)$ is, by definition, an element of the dual space $H^*$. Applying Riesz representation theorem, we know there is an element $g_x \in H$ such that for any $v \in H$, \(DF(x) v = \langle g_x, v \rangle\). That element $g_x$ is the gradient of the functional $f$. This clearly defines the gradient of a functional, without having to agree on notations or conventions.

General case

What can we do for a function $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$? First, we can’t apply Riesz representation theorem. Also, it is not clear how we optimize that function $f$. We’d need to define a total order on $\mathbb{R}^m$ that would coincide with the objective of optimization. For that reason, I see the definition of a gradient in that case as more of a convention. There are really two conventions, which are a transpose of each other (see layout conventions), and I adopt the convention used in Nocedal & Wright’s Numerical Optimization textbook (section A.2 Derivatives). Linear maps between two finite-dimensional spaces can all be described by the action of a matrix. Nocedal & Wright call the Jacobian the matrix $J(x) \in \mathbb{R}^{m \times n}$ that verifies, for any $v \in \mathbb{R}^n$, $Df(x)v = J(x) \cdotp v$. The gradient is defined to be the transpose,

\[\begin{align} J(x) & = \left[ \frac{\partial f_i}{\partial x_j} \right]_{ij} \in \mathbb{R}^{m \times n} \\ \nabla f(x) & = \left[ \frac{\partial f_j}{\partial x_i} \right]_{ij} = J(x)^T = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \dots & \frac{\partial f_m}{\partial x_1} \\ \vdots & & \vdots \\ \frac{\partial f_1}{\partial x_n} & \dots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{n \times m} \end{align}\]

One thing to be careful about with that notation, the chain-rule, as we know it, applies to the Jacobian, i.e., if $h(x) = f(g(x))$, then

\[J_h(x) = J_f(g(x)) \cdotp J_g(x)\]

and therefore in terms of the gradient, we get the transpose,

\[\nabla h(x) = \nabla g(x) \cdotp \nabla f(g(x))\]

For instance, let’s assume $f: \mathbb{R}^m \rightarrow \mathbb{R}$ and $g: \mathbb{R}^n \rightarrow \mathbb{R}^m$. Then, with $y_k = (g(x))_k$, for any $i=1,\dots,n$,

\[\frac{\partial h}{\partial x_i} = \sum_{k=1}^m \frac{\partial f}{\partial y_k} \frac{\partial y_k}{\partial x_i} = \left [\frac{\partial (g(x))_i}{\partial x_i} \right]_i^T \cdotp \nabla f(g(x))\]

Then putting all indices $i$ together (in rows), we get the expression above for the gradient.

Derivative with respect to a matrix

In that case also, this is just a convenient notation. For a function $f : \mathbb{R}^{m \times n} \rightarrow \mathbb{R}$, we define

\[\frac{\partial f(M)}{\partial M} = \left[ \frac{\partial f(M)}{\partial m_{ij}} \right]_{ij}\]

Examples

If $f(x) = Ax + b$

We then have $f : \mathbb{R}^n \rightarrow \mathbb{R}^m$. We can apply the definition of the Gateaux derivative,

\[f(x + \varepsilon v) = f(x) + \varepsilon A v\]

We can conlude directly that

\[\nabla f(x) = A^T\]

If $f(x) = \frac12 x^T Q x $

We then have $f : \mathbb{R}^n \rightarrow \mathbb{R}$. We can apply the definition of the Gateaux derivative,

\[f(x + \varepsilon v) = f(x) + \frac12 \varepsilon \left( x^T Q v + v^T Q x \right) + \frac12 \varepsilon^2 v^T Q v\]

We conlude that

\[\nabla f(x) = \frac12 ( Q + Q^T) x\]

In the special case that $Q=Q^T$ (symmetric), we have

\[\nabla f(x) = Q x\]

If $f(x) = \frac12 | Ax+b|^2$

We then have $f : \mathbb{R}^n \rightarrow \mathbb{R}$. Following the same approach, we get

\[f(x + \varepsilon v) = f(x) + \frac12 \varepsilon \left( (Ax+b)^T A v + (Av)^T(Ax+b) \right) + \frac12 \varepsilon^2 \|Av\|^2\]

Since \(x^T \cdotp v = v^T \cdotp x\), we can conlude that

\[\nabla f(x) = A^T \cdotp (Ax+b)\]

Alternatively, we can use the chain-rule with $g(y) = \frac12 | y |^2$ and $f(x) = g(Ax + b)$. Since $\nabla (Ax+b) = A^T$ and $\nabla g(y) = y$, we have

\[\nabla f(x) = A^T \cdotp (Ax+b)\]

If $f(A) = Ax+b$

In that case, I’m not sure it helps to talk about a gradient. However we can still calculate the derivative (e.g., using the formula for the Gateaux derivative), and we get

\[Df(A) M = M \cdotp x\]

If $f(A) = \frac12 | Ax+b|^2$

Here we have $f: \mathbb{R}^{m \times n} \rightarrow \mathbb{R}$. It’s tempting to use the chain-rule, but I couldn’t agree with myself on a logical convention. And I think sadly this is the main conclusion of that small post: it’s safer to always rely on the entry-wise derivative. In this case, with $y = Ax + b \in \mathbb{R}^m$, we have

\[\begin{align*} \frac{\partial f(A)}{\partial a_{ij}} & = \sum_k \frac{\partial g(Ax + b)}{\partial y_k} \frac{\partial y_k}{\partial a_{ij}} \\ & = \left[ \frac{\partial y_1}{\partial a_{ij}}, \dots, \frac{\partial y_m}{\partial a_{ij}} \right] \cdotp (Ax + b) \end{align*}\]

Let’s look at the partial derivatives for $y$, using the notation $\delta_{ik} = 1$ if $i=k$ and $0$ otherwise,

\[\frac{\partial y_k}{\partial a_{ij}} = x_j \delta_{ik}\]

Such that

\[\frac{\partial f(A)}{\partial a_{ij}} = (Ax+b)_i x_j\]

And putting all indices back together,

\[\frac{\partial f(A)}{\partial A} = (Ax+b) \cdotp x^T\]

Using Font Awesome on your webpage

Font awesome is an easy way to include cool icons to your website, like the one for Github, Linkedin, or just email (see side bar). To use, all you need to do is add

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">

the header of your html file. Then choose the icon you want to use. You can modify the size of the icons by adding the appropriate option. For instance, for a large Github icon, you would add

<i class="fa fa-github fa-lg">

Market-cap weighted portfolios are self-rebalancing

This is a very simple note about a proof that I regularly forget. It is often said that a great advantage of market-cap weighted portfolios is that they are self rebalancing. That is, when you choose the weights of each security in a portfolio according to the proportion of their market-cap, the weights remain consistent with the original definition (under some rather mild assumptions).

Return of a portfolio

First, an even simpler result that we need in the rest of that note: The return of a portfolio is simply the weighted average of the returns of the securities in the portfolio. Let’s introduce some notation: let’s call $X_t$ the \$-value of the portfolio at time $t$, and $X_t = \sum_i X_{i,t}$ where $X_{i,t} = w_{i,t} X_t$ is the \$-value of the $i^\text{th}$ security in the portfolio at time $t$. The sum is taken over all securities in the portfolio, and we assume the weights sum to 1. Next the return of the portfolio (similarly for any security $i$) is defined by $r_t = X_t/X_{t-1}-1$. Then, \(\begin{align} X_t & = \sum_i X_{i,t} = \sum_i (1+r_{i,t}) X_{i,t-1} = \sum_i (1+r_{i,t}) w_{i,t-1} X_{t-1} \notag \\ & = X_{t-1} \left( 1 + \sum_i w_{i,t-1} r_{i,t} \right), \end{align}\) since by definition $\sum_i w_{i,t} = 1$ at any time $t$. Note that the same conclusion applies to all type of returns, including continuously compounded returns \(\bar{r}_{i,t}\), since $1+r_{i,t} = e^{\bar{r}_{i,t}}$.

Self-rebalancing

Let’s define $M_{i,t}$ as the market-cap of security $i$, and let $M_t = \sum_i M_{i,t}$. Now let’s assume that in the period $t-1$, the weights are calculated as a proportion of their relative market-cap, i.e., $w_{i,t-1} = M_{i,t-1} / M_{t-1}$. And let’s assume that from period $t-1$ to $t$, the market-cap of all securities $i$ only vary through its price (number of shares remain constant, no M$\&$A,…). Then, the ratios of market-cap will vary from $t-1$ to $t$ as \[ \frac{M_{i,t}/M_t}{M_{i,t-1}/M_{t-1}} = \frac{M_{i,t}}{M_{i,t-1}} \frac{M_{t-1}}{M_t} = \frac{1+r_{i,t}}{1+r_t} . \] On the other hand, the weights of the portfolios will vary as \[ \frac{w_{i,t}}{w_{i,t-1}} = \frac{X_{i,t}/X_t}{X_{i,t-1}/X_{t-1}} = \frac{X_{i,t}}{X_{i,t-1}} \frac{X_{t-1}}{X_t} = \frac{1+r_{i,t}}{1+r_t} . \]