13 Nov 2018
The goal of this post is to dive deeper into the derivation of the backpropagation
algorithm. This entry can be seen as a sub-post of that main
post
discussing a Python implementation of an artificial neural network, and all
notations will be exactly the same. For conventions and definitions about
vector/matrix calculus, see that post.
The loss function is defined as the sum of individual contributions for each
data point $i$,
\[\mathcal{L}(\{ {\bf W}^l, {\bf b}^l \}_l) =
\frac1N \sum_{i=1}^N c(\{ {\bf W}^l, {\bf b}^l \}_l, {\bf x}_i, {\bf y}_i)\]
For examples of loss functions, have a look at that
post.
In this post, I’m going to focus only on how to calculate the derivatives
\(\frac{\partial c({\bf x}_i)}{\partial {\bf b}^l}\)
and
\(\frac{\partial c({\bf x}_i)}{\partial {\bf W}^l}\).
Derivatives with respect to ${\bf b}^l$
Applying the chain-rule, we get that
\[\frac{\partial c({\bf x}_i)}{\partial {\bf b}^l} =
\frac{\partial {\bf a}^L}{\partial {\bf b}^l} \cdotp
\frac{\partial c({\bf x}_i)}{\partial {\bf a}^L}\]
For the output layer, i.e., $l=L$, since ${\bf a}^L = {\bf W}^L \cdotp {\bf
a}^{L-1} + {\bf b}^L$, applying the chain-rule again
with $z^L_k$, we have
\[\frac{\partial {\bf a}^L}{\partial b^L_j} =
\sum_{k=1}^{n_L} \frac{\partial {\bf a}^L}{\partial z_k^L}
\frac{\partial z_k^L}{\partial b^L_j}\]
Let ${\bf e}_k = [\dots,0,1,0,\dots]^T$ be the vector filled with all zeros
but for one 1 at the $k^\text{th}$ entry. Then, we have
\[\frac{\partial {\bf a}^L}{\partial z_k^L} = f'(z_k^L) {\bf e}_k,
\quad \frac{\partial z_k^L}{\partial b^L_j} = \delta_{jk}\]
This means
\[\frac{\partial {\bf a}^L}{\partial b^L_j} = f'(z_j^L) {\bf e}_j\]
Following the convention defined
here, we’ll stack these vectors row by
row to get
\[\frac{\partial {\bf a}^L({\bf x}_i)}{\partial {\bf b}^L} =
\begin{bmatrix}
f'( z_1^L({\bf x}_i) ) & & 0 \\
& \ddots & \\
0 & & f'(z_{n_L}^L({\bf x}_i))
\end{bmatrix}\]
In the end, with \(z_k^L({\bf x}_i) = ({\bf w}^L_k)^T \cdotp {\bf a}^{L-1}({\bf
x}_i) + b^L_k\), we
get
\[\frac{\partial c({\bf x}_i)}{\partial {\bf b}^L} =
\begin{bmatrix}
f'( z_1^L({\bf x}_i) ) & & 0 \\
& \ddots & \\
0 & & f'(z_{n_L}^L({\bf x}_i))
\end{bmatrix} \cdotp
\frac{\partial c({\bf x}_i)}{\partial {\bf a}^L}\]
Using the results in the post loss function, we can show
actual examples. For a least-square loss function, we have
\(\frac{\partial c({\bf x}_i)}{\partial {\bf a}^L} = ({\bf a}^L({\bf x}_i) -
{\bf y}_i)\). Whereas in the case of a multi-class classifier with a softmax
output layer (which is not actually covered by the above formula), we would have
\(\frac{\partial c({\bf x}_i)}{\partial {\bf a}^L} = a^L({\bf x}_i)_{y_i} {\bf e}_{y_i}\).
For an interior layer $l$, we need to write an induction between successive
layers. But first, let’s make a few observations. In the derivatives, the
variables ${\bf z}^l$ and ${\bf b}^l$ are inter-changeable. Indeed, since
\(\frac{\partial {\bf z}^l}{\partial {\bf b}^l} = I_{n_l}\), we have
\[\frac{\partial \cdotp}{\partial {\bf b}^l} =
\frac{\partial {\bf z}^l}{\partial {\bf b}^l}
\frac{\partial \cdotp}{\partial {\bf z}^l} = \frac{\partial \cdotp}{\partial
{\bf z}^l}\]
Next, the result for \(\frac{\partial {\bf a}^L}{\partial {\bf b}^L}\)
derived above
can be generalized to any layer $l$, i.e.,
\[\frac{\partial {\bf a}^l({\bf x}_i)}{\partial {\bf b}^l} =
\begin{bmatrix}
f'( z_1^l({\bf x}_i) ) & & 0 \\
& \ddots & \\
0 & & f'(z_{n_l}^l({\bf x}_i))
\end{bmatrix}\]
We now apply the chain-rule,
\[\begin{align*}
\frac{\partial c({\bf x}_i)}{\partial {\bf b}^l} & = \frac{\partial c({\bf x}_i)}{\partial {\bf z}^l} \\
& = \frac{\partial {\bf a}^l}{\partial {\bf z}^l}
\frac{\partial {\bf z}^{l+1}}{\partial {\bf a}^l}
\frac{\partial c({\bf x}_i)}{\partial {\bf z}^{l+1}} \\
& = \frac{\partial {\bf a}^l}{\partial {\bf b}^l}
\frac{\partial {\bf z}^{l+1}}{\partial {\bf a}^l}
\frac{\partial c({\bf x}_i)}{\partial {\bf b}^{l+1}}
\end{align*}\]
And in the end,
\[\frac{\partial c({\bf x}_i)}{\partial {\bf b}^l} =
\begin{bmatrix}
f'( z_1^l({\bf x}_i) ) & & 0 \\
& \ddots & \\
0 & & f'(z_{n_l}^l({\bf x}_i))
\end{bmatrix} \cdotp
({\bf W}^{l+1})^T \cdotp
\frac{\partial c({\bf x}_i)}{\partial {\bf b}^{l+1}}\]
Derivatives with respect to ${\bf W}^l$
I am going to re-use the results in the previous section. In fact, this is yet
another application of the chain-rule,
\[\begin{align*}
\frac{\partial c({\bf x}_i)}{\partial {\bf w}_i^l} & =
\frac{\partial {\bf z}^l}{\partial {\bf w}_i^l}
\frac{\partial c({\bf x}_i)}{\partial {\bf z}^l} \\
& = \begin{bmatrix} \dots & 0 & {\bf {a}}^{l-1}({\bf x}_i) & 0 & \dots \end{bmatrix} \cdotp
\frac{\partial c({\bf x}_i)}{\partial {\bf b}^l} \\
& = {\bf a}^{l-1}({\bf x}_i) \frac{\partial c({\bf x}_i)}{\partial b^l_i}
\end{align*}\]
Since \({\bf W}^l = [{\bf w}_1^l,\dots,{\bf w}_{n_l}^l]^T\), we stack all
\(\frac{\partial c({\bf x}_i)}{\partial {\bf w}_i^l}\) in columns then take the transpose
to get
\[\frac{\partial c({\bf x}_i)}{\partial {\bf W}^l} =
\frac{\partial c({\bf x}_i)}{\partial {\bf b}^l} \cdotp ({\bf a}^{l-1}({\bf x}_i))^T\]
Dead neurons
It is informative to investigate what can cause the gradient of the cost
functional to be zero. Note that this only applies to a single datapoint.
(1) that will be the case if the next layer is dead (all gradients evaluate to
zero), b/c of the back-propagation recursion. So once you have a dead layer (all
neurons in that layer are dead), then you stop backpropagating past that layer.
(2) that will also be the case if $f’(z^l_i) = 0$. With RELU, this will happen
with negative values of ${\bf z}^l$, and for sigmoid / tanh, this will happen with very large
values (either negative of positive).
(3) if a column of ${\bf W}^{l+1}$ is zero.
Parallelism
The backpropagation algorithm is inherently sequential, which makes its
parallelism challenging, besides the embarassing parallelism of the sum over the
data points. This is actually an advantage for batch algorithms, that require all
$N$ points, over stochastic algorithms like stochastic gradient which compute
the derivative for a single data point $i$.
Checkpointing
To compute the derivatives, you need to store all \({\bf
a}^l({\bf x}_i)\) that were computed during the forward propagation.
I’m wondering if memory storage could be an issue. In which case a checkpointing
algorithm would be a natural solution
Higher-order derivatives
The next step is to look at how you can compute the action of a given vector on
the Hessian of the loss function.
12 Nov 2018
The goal of the post is to summarize a few important statistical hypothesis tests that come
back often in practice.
t-tests
The first class of tests we are going to look at are
t-tests, i.e., tests that
involve a statistic following a Student’s t
distribution.
One sample t-test
The simplest t-test, a one sample
t-test,
can be used to check whether the sample mean is equal to a certain value
when all samples
\(\{ X_i \}_i\)
are drawn from distributions that have
the same mean $\mu$ and variance $\sigma^2$. This is a variation of a z-test when the
variance of the population is unknown and needs to be estimated from the
population. Indeed, from the central limit theorem, we know the sample mean,
$\bar{X} = \frac1n \sum_{i=1}^n X_i$, tends (as the number of samples increase)
to a normal distribution, i.e., in the limit of a large number of samples,
\[ \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim \mathcal{N}(0,1) \]
Clearly we have $\mathbb{E}[\bar{X}] = \mu$, and if all samples are independent,
$Var[\bar{X}] = \sigma^2/n$.
However if we do not know $\sigma^2$ a priori, and use instead the sampling
variance
\(s^2 = \frac1{n-1} \sum_{i=1}^n (X_i - \bar{X})^2\), the distribution becomes
\[ \frac{\bar{X} - \mu}{s/\sqrt{n}} \sim t(n-1) \]
To apply a one-sample t-test to the parameters of a linear regression, the
variance is not scaled by the square root of the number of samples.
The standard error is calculated directly from the regression.
For an OLS, $Y = X.\beta +
\varepsilon$, with $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$, the parameters
are estimated by $b = (X^T \cdotp X)^{-1} (X^T \cdotp Y) = \beta + (X^T \cdotp
X)^{-1} X^T \varepsilon$. The variance of the estiamte $b$ is then $Var[b] =
\sigma^2 (X^T \cdotp X)^{-1}$. Since $\sigma^2$ is unknown, we use instead the
sampling variance $s^2 = \frac1{n-1} \sum_{i=1}^n (y_i - \hat{y}_i)^2$, where
$\hat{y}_i$ is the estimate for $y_i$.
Independent two sample t-test
We can use an independent two sample
t-test
to compare the population mean of two populations, when the samples for each
population do not have a clear connection with each other. We’ll look at the
case of the dependent two-sample t-test afterward.
Again, we typically do not
know the variance of each population and must resort to the sampling variance,
which leads to a Student’s t distribution. The statistic we use here is
\[ \frac{ \bar{X}_1 - \bar{X}_2}{s_d} \sim t(df) \]as
where $s_d$ is the standard deviation and $df$ is the number of degrees of
freedom. The correct values to use depend on the situation.
If the two populations have the same variance, we can use a pooled sample
variance $s_p$ to compute $s_d$, i.e.,
\[ s_d = s_p \sqrt{1/n_1 + 1/n_2}, \]
where \(s_p^2 = (\sum_i (n_i-1) s_i^2) / (\sum_j (n_j-1))\), and
\[ df = n_1 + n_2 - 2 \]
If the two populations have different variance (and in the general case
different number of samples), then
\[ s_d = \sqrt{s_1^1/n_1 + s_2^2/n_2} . \]
The exact distribution in that case is a mess, but for all practical cases it
can be approximated by a Student’s t-test with degrees of freedom
\[ df = \frac{(s_1^1/n_1 + s_2^2/n_2)^2}{(s_1^2/n_1)^2/(n_1-1) +
(s_2^2/n_2)^2/(n_2-1)} \]
A dependent t-test for paired samples is when two samples are related to
each other, for instance, if those are the same patients before and after a
treatment. In that case, the solution is to test the difference of the pairs in
a one-sample t-test.
Chi-square tests
There are a different flavours of chi-square
tests.
I’m here looking at the goodness-of-fit, i.e., test if a categorical population
is distributed according to a theoretical distribution (null hypothesis). In
that case, Pearson’s chi-square test is actually an approximation to the
G-test,
\[ G = 2 \sum_i O_i \ln (O_i/E_i) \sim \chi^2(df) \]
where $O_i$ is the observed count for category $i$, $E_i$ is the expected count
for category $i$ (under the null). The degrees of freedom is $df = $ number of
categories $-$ (number of parameters in the distribution + 1); for instance, the
number of parameters for a uniform distribution is 0, for a standard normal is
2,$\dots$
How do we get to the chi-square test from a G-test?
We use the expansion $\ln(1+u) \approx u - u^2/2$ when $|u|\ll 1$.
If we assume that $O_i$ and
$E_i$ are close to each other, then
\[ O_i \ln(E_i/O_i) = O_i \ln \left( 1 + \frac{E_i-O_i}{O_i} \right)
\approx (E_i - O_i) - \frac{(E_i-O_i)^2}{2O_i} \]
Going back to the G statistic, and using the fact that $\sum_i E_i = \sum_i
O_i$, we have
\[ G = -2 \sum_i O_i \ln (E_i/O_i)
\approx \sum_i \frac{(O_i-E_i)^2}{O_i} \]
which is the chi-square test.
ANOVA test
I won’t go much in details. The ANOVA
test can be seen as a
generalization of a two-sample t-test to the case of multiple populations.
There is also a whole chapter dedicated to it in Casella & Berger’s Statistical
Inference textbook.
Ljung-Box test
The Ljung-Box test is a portmanteau
test to test the correlation of a time-series. It allows to test whether
the first $h$ auto-correlations of a time-series are like white noise (null) or
not (serial correlation). We define the statistic
\[ Q = n (n+2) \sum_{k=1}^h \frac{\hat{\rho}_k^2}{n-k} \sim \chi^2(h) \]
where $\hat{\rho}_k$ is the lag-k sample autocorrelation. Note that this is a
one-sided test.
Unit root test
If a time-series has a unit root, it is, among other things, non-stationary.
For instance, we can test for the presence of a unit root to decide whether a time-series
needs to be differencied or not.
The Dickey-Fuller test is a popular test for unit roots.
12 Nov 2018
Notes
In optimization, you need to take derivatives. To do that in $\mathbb{R}^n$, you
need matrix calculus. The objective of this note is to summarize the important
definitions and conventions, explain them whenever possible, and show a few
examples.
Frechet and Gateaux derivatives
Let’s define a function $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$. That
function is
Frechet differentiable
at $x$ if
there exists a bounded (necessarily the case for an operator between finite dimensional
spaces) linear operator
$Df(x): \mathbb{R}^n \rightarrow \mathbb{R}^m$
such that
\[ \frac{| f(x+h) - f(x) - Df(x)h |}{| h |} \rightarrow 0 \]
as $|h| \rightarrow 0$.
A somehow weaker definition of differentiability is the Gateaux
differentiability. A
function $f$ is Gateaux differentiable at $x$ if, for any $v
\in \mathbb{R}^n$ the following limit exists
\[ lim_{\varepsilon \rightarrow 0} \frac{f(x + \varepsilon v) - f(x)} \varepsilon . \]
If that limit exists, we can calculate it as
\[ lim_{\varepsilon \rightarrow 0} \frac{f(x + \varepsilon v) -
f(x)}\varepsilon = \frac{d}{d\varepsilon} f(x + \varepsilon v)
|_{\varepsilon=0} . \]
If $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$
is Frechet differentiable, it is necessarily Gateaux differentiable, and
\(\frac{d}{d \varepsilon} f(x + \varepsilon v) |_{\varepsilon=0} = Df(x) v\).
Gradient of a functional
Let’s call $H = \mathbb{R}^n$, which is a Hilbert space with inner product
$\langle \cdotp, \cdotp \rangle$. For a functional $f: \mathbb{R}^n \rightarrow
\mathbb{R}$, the derivative $Df(x)$ is, by definition, an element of the dual
space $H^*$.
Applying Riesz representation
theorem, we know
there is an element $g_x \in H$ such that for any $v \in H$,
\(DF(x) v = \langle g_x, v \rangle\). That element $g_x$ is the gradient of
the functional $f$. This clearly defines the gradient of a functional,
without having to agree on notations or conventions.
General case
What can we do for a function $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$?
First, we can’t apply Riesz representation theorem. Also, it is not clear how we
optimize that function $f$. We’d need to define a total
order
on $\mathbb{R}^m$ that would coincide with the objective of optimization. For
that reason, I see the definition of a gradient in that case as more of a
convention. There are really two conventions, which are a transpose of each
other (see layout
conventions),
and I adopt the convention used in Nocedal & Wright’s Numerical Optimization
textbook (section A.2 Derivatives).
Linear maps between two finite-dimensional spaces can
all be described by the action of a matrix.
Nocedal & Wright call the Jacobian the matrix
$J(x) \in \mathbb{R}^{m \times n}$
that verifies, for any $v \in \mathbb{R}^n$, $Df(x)v = J(x) \cdotp v$.
The gradient is defined to be the transpose,
\[\begin{align}
J(x) & = \left[ \frac{\partial f_i}{\partial x_j} \right]_{ij}
\in \mathbb{R}^{m \times n} \\
\nabla f(x) & = \left[ \frac{\partial f_j}{\partial x_i} \right]_{ij} = J(x)^T
= \begin{bmatrix}
\frac{\partial f_1}{\partial x_1} & \dots & \frac{\partial f_m}{\partial x_1} \\
\vdots & & \vdots \\
\frac{\partial f_1}{\partial x_n} & \dots & \frac{\partial f_m}{\partial x_n}
\end{bmatrix}
\in \mathbb{R}^{n \times m}
\end{align}\]
One thing to be careful about with that notation,
the chain-rule, as we know it, applies to the Jacobian, i.e., if $h(x) =
f(g(x))$, then
\[J_h(x) = J_f(g(x)) \cdotp J_g(x)\]
and therefore in terms of the gradient, we get the transpose,
\[\nabla h(x) = \nabla g(x) \cdotp \nabla f(g(x))\]
For instance, let’s assume $f: \mathbb{R}^m \rightarrow \mathbb{R}$ and $g: \mathbb{R}^n
\rightarrow \mathbb{R}^m$. Then, with $y_k = (g(x))_k$, for any $i=1,\dots,n$,
\[\frac{\partial h}{\partial x_i} = \sum_{k=1}^m \frac{\partial f}{\partial
y_k} \frac{\partial y_k}{\partial x_i}
= \left [\frac{\partial (g(x))_i}{\partial x_i} \right]_i^T \cdotp \nabla f(g(x))\]
Then putting all indices $i$ together (in rows), we get the expression above for the
gradient.
Derivative with respect to a matrix
In that case also, this is just a convenient notation. For a function $f :
\mathbb{R}^{m \times n} \rightarrow \mathbb{R}$, we define
\[\frac{\partial f(M)}{\partial M} = \left[
\frac{\partial f(M)}{\partial m_{ij}} \right]_{ij}\]
Examples
If $f(x) = Ax + b$
We then have $f : \mathbb{R}^n \rightarrow \mathbb{R}^m$.
We can apply the definition of the Gateaux derivative,
\[f(x + \varepsilon v) = f(x) + \varepsilon A v\]
We can conlude directly that
\[\nabla f(x) = A^T\]
If $f(x) = \frac12 x^T Q x $
We then have $f : \mathbb{R}^n \rightarrow \mathbb{R}$.
We can apply the definition of the Gateaux derivative,
\[f(x + \varepsilon v) = f(x) + \frac12 \varepsilon
\left( x^T Q v + v^T Q x \right) +
\frac12 \varepsilon^2 v^T Q v\]
We conlude that
\[\nabla f(x) = \frac12 ( Q + Q^T) x\]
In the special case that $Q=Q^T$ (symmetric), we have
\[\nabla f(x) = Q x\]
If $f(x) = \frac12 | Ax+b|^2$
We then have $f : \mathbb{R}^n \rightarrow \mathbb{R}$.
Following the same approach, we get
\[f(x + \varepsilon v) = f(x) +
\frac12 \varepsilon \left( (Ax+b)^T A v + (Av)^T(Ax+b) \right) +
\frac12 \varepsilon^2 \|Av\|^2\]
Since \(x^T \cdotp v = v^T \cdotp x\),
we can conlude that
\[\nabla f(x) = A^T \cdotp (Ax+b)\]
Alternatively, we can use the chain-rule with $g(y) = \frac12 | y |^2$ and
$f(x) = g(Ax + b)$.
Since $\nabla (Ax+b) = A^T$ and $\nabla g(y) = y$, we have
\[\nabla f(x) = A^T \cdotp (Ax+b)\]
If $f(A) = Ax+b$
In that case, I’m not sure it helps to talk about a gradient. However we can
still calculate the derivative (e.g., using the formula for the Gateaux
derivative), and we get
\[Df(A) M = M \cdotp x\]
If $f(A) = \frac12 | Ax+b|^2$
Here we have $f: \mathbb{R}^{m \times n} \rightarrow \mathbb{R}$.
It’s tempting to use the chain-rule, but I couldn’t agree with myself on a
logical convention. And I think sadly this is the main conclusion of that small
post: it’s safer to always rely on the entry-wise derivative. In this case,
with $y = Ax + b \in \mathbb{R}^m$, we have
\[\begin{align*}
\frac{\partial f(A)}{\partial a_{ij}} & =
\sum_k \frac{\partial g(Ax + b)}{\partial y_k} \frac{\partial y_k}{\partial
a_{ij}} \\
& = \left[ \frac{\partial y_1}{\partial a_{ij}}, \dots, \frac{\partial y_m}{\partial
a_{ij}} \right] \cdotp (Ax + b)
\end{align*}\]
Let’s look at the partial derivatives for $y$, using the notation $\delta_{ik} =
1$ if $i=k$ and $0$ otherwise,
\[\frac{\partial y_k}{\partial a_{ij}} = x_j \delta_{ik}\]
Such that
\[\frac{\partial f(A)}{\partial a_{ij}} = (Ax+b)_i x_j\]
And putting all indices back together,
\[\frac{\partial f(A)}{\partial A} = (Ax+b) \cdotp x^T\]
04 Sep 2018
Font awesome is an easy way to include cool icons to your website, like the one
for Github, Linkedin, or just email (see side bar). To use, all you need to do
is add
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">
the header of your html file. Then choose
the icon you want to use. You can modify the size of the icons by adding the
appropriate option.
For instance, for a large Github icon, you would add
<i class="fa fa-github fa-lg">
01 Sep 2018
This is a very simple note about a proof that I regularly forget. It is often
said that a great advantage of market-cap weighted portfolios is that they are
self rebalancing. That is, when you choose the weights of each security in a
portfolio according to the proportion of their market-cap, the weights remain
consistent with the original definition (under some rather mild assumptions).
Return of a portfolio
First, an even simpler result that we need in the rest of that note: The return
of a portfolio is simply the weighted average of the returns of the securities
in the portfolio. Let’s introduce some notation: let’s call $X_t$ the
\$-value of
the portfolio at time $t$, and $X_t = \sum_i X_{i,t}$ where $X_{i,t} = w_{i,t}
X_t$ is the \$-value of the $i^\text{th}$ security in the portfolio at time
$t$.
The sum is taken over all securities in the portfolio, and we assume the weights
sum to 1.
Next the return of the portfolio (similarly for any security $i$) is defined by $r_t =
X_t/X_{t-1}-1$.
Then,
\(\begin{align}
X_t & = \sum_i X_{i,t} = \sum_i (1+r_{i,t}) X_{i,t-1} = \sum_i (1+r_{i,t})
w_{i,t-1} X_{t-1} \notag \\
& = X_{t-1} \left( 1 + \sum_i w_{i,t-1} r_{i,t} \right),
\end{align}\)
since by definition $\sum_i w_{i,t} = 1$ at any time $t$.
Note that the same conclusion applies to all type of returns, including
continuously compounded returns
\(\bar{r}_{i,t}\), since $1+r_{i,t} = e^{\bar{r}_{i,t}}$.
Self-rebalancing
Let’s define $M_{i,t}$ as the market-cap of security $i$, and let $M_t =
\sum_i M_{i,t}$.
Now let’s assume that in the period $t-1$, the weights are calculated as a
proportion of their relative market-cap, i.e., $w_{i,t-1} = M_{i,t-1} /
M_{t-1}$. And let’s assume that from period $t-1$ to $t$, the market-cap of all
securities $i$ only vary through its price (number of shares remain constant, no
M$\&$A,…).
Then, the ratios of market-cap will vary from $t-1$ to $t$
as
\[ \frac{M_{i,t}/M_t}{M_{i,t-1}/M_{t-1}} = \frac{M_{i,t}}{M_{i,t-1}} \frac{M_{t-1}}{M_t}
= \frac{1+r_{i,t}}{1+r_t} . \]
On the other hand, the weights of the portfolios will vary as
\[ \frac{w_{i,t}}{w_{i,t-1}} = \frac{X_{i,t}/X_t}{X_{i,t-1}/X_{t-1}}
= \frac{X_{i,t}}{X_{i,t-1}} \frac{X_{t-1}}{X_t}
= \frac{1+r_{i,t}}{1+r_t} . \]