11 May 2018
Let $X_1,\ldots,X_n$ be a random sample from a population with mean $\mu$ and
variance $\sigma^2 < \infty$. Any function of this
random sample is called a statistic, and the probability distribution of a statistic is
called the sampling distribution of that statistic.
In this note, we look at two important statistics.
Sample mean
The sample mean is defined as \[ \bar{X} = \frac1n \sum_{i=1}^n X_i \]
The sample mean is an unbiased estimator of the mean of the distribution, i.e.,
\[ \mathbb{E}[\bar{X}] = \frac1n \sum_{i=1}^n \mathbb{E}[X_i] = \mu\]
And the variance of the sample mean is
\(\begin{align}
Var[\bar{X}] & = \mathbb{E}[\left( \frac1n \sum_{i=1}^n (X_i-\mu) \right)^2 ]
\notag \\
& = \frac{1}{n^2} \left( \sum_{i=1}^n \mathbb{E}[ (X_i-\mu)^2] + \sum_{i j}
\mathbb{E}[(X_i-\mu)(X_j-\mu)] \right) \notag \\
Var[\bar{X}] & = \frac{\sigma^2}n
\end{align}\)
Sample variance
The sample variance is defined as \[ (S_n^2=) S^2 = \frac1{n-1} \sum_{i=1}^n (X_i -
\bar{X})^2 \]
The sample variance is an unbiased estimator of the variance of the distribution
the samples are taken from, i.e.,
\(\begin{align}
\mathbb{E}[S^2] & = \frac1{n-1} \sum_{i=1}^n \left( \mathbb{E}[X_i^2] - \frac2n
\sum_{j=1}^n \mathbb{E}[X_i X_j] + \mathbb{E}[\bar{X}^2] \right) \notag\\
& = \frac1{n-1} \sum_{i=1}^n \left( \mu^2 + \sigma^2 - \frac2n (\sigma^2 + n
\mu^2 ) + \frac1{n^2} (n(\mu^2+\sigma^2) + (n^2-n)\mu^2) \right) \notag\\
\mathbb{E}[S^2] & = \sigma^2
\end{align}\)
However, because of Jensen’s
inequality,
\(\sqrt{S^2}\) is a biased estimator of the standard deviation of the distribution
we sampled from. Indeed, square root being strictly concave over its domain, we
have that \(\mathbb{E}[\sqrt{S^2}] < \sqrt{\mathbb{E}[S^2]} = \sigma\) (unless
$\sigma=0$).
On the other hand, because square root is a continuous function over
$\mathbb{R}^+$, if $S^2$ is a consistent estimator (converges in probability),
$\sqrt{S^2}$ is also a consistent estimator (i.e., the bias disappears as the
number of samples grow). Using Chebychev’s inequality, we see that
\[ \mathbb{P}[|S_n^2 - \sigma^2| \geq \varepsilon] \leq
\frac{Var[S_n^2]}{\varepsilon^2}, \notag \]
that is $S^2_n$ is consistent if $Var[S_n^2]$ goes to zero as the number of
samples increase.
Without a proof, one can show that with $\theta$ the fourth moment of the
underlying distribution, we have
\[ Var[S^2] = \frac1n (\theta - \frac{n-3}{n-1} \sigma^4) \]
Hypothesis testing
A common test for a random sample is to test the value of its mean.
To do so, we can use the statistic
\[ \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \]
where $\mu$ is the value of the statistic we test for.
For hypothesis testing, we need to know the distribution of that statistic. In
the general case, we can only conclude asymptotically as the central limit
theorem states that if the
underlying distribution has finite variance,
\[ \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \rightarrow \mathcal{N}(0,1) \]
In practice, $\sigma$ is rarely known, and we want to replace it with $S$. We
know that $S$ is a consistent estimator of $\sigma$, and we can show that $\sigma /
S \rightarrow 1$.
Then by Slutsky’s theorem,
\[ \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \frac{\sigma}{S} \rightarrow \mathcal{N}(0,1) \]
However this result is only valid asymptotically. Often in practice, people
assume the asymptotic regime is valid, and use a standard normal distribution
for that statistic; but there is no general rule to know when that approximation
is valid.
On the other hand, if the random samples come from a normal distribution
$\mathcal{N}(\mu,\sigma^2)$, we can say something more general. In that case,
\[ \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim \mathcal{N}(0,1) \]
and
\[ \frac{\bar{X} - \mu}{S / \sqrt{n}} \sim t_{n-1} \]
a Student’s t-distribution with $n-1$ degrees of freedoms.
04 May 2018
Delta method
The Delta method can be used to approximate the variance of a function of a random
variable that converges to a normal random variable.
Assuming \(\sqrt{n} (Y_n - \theta) \rightarrow
\mathcal{N}(0,\sigma^2)\) (distribution) with $\theta \in
\mathbb{R}$, then for a given function $g$ (with $g’(\theta) \neq 0$), we have
\[ \sqrt{n} (g(Y_n) - g(\theta)) \rightarrow \mathcal{N}(0, \sigma^2
[g’(\theta)]^2) \text{ (distribution)} \]
Convergence in distribution and convergence in probability
In the proof of the Delta method, we use the result that
\[ \sqrt{n} (Y_n - \theta) \rightarrow
\mathcal{N}(0,\sigma^2) \text{ (distribution)} \Rightarrow
Y_n \rightarrow \theta \text{ (probability),} \]
i.e., for any $\varepsilon > 0$,
$\mathbb{P}[|Y_n-\theta| \geq \varepsilon] \rightarrow 0$ as $n$ grows.
Now, let us introduce $X_n = \sqrt{n}(Y_n - \theta)$. Then
\[ Y_n - \theta = \frac{1}{\sqrt{n}} X_n \]
$X_n$ converges to $\mathcal{N}(0,\sigma^2)$ in distribution and $1/\sqrt{n}$
converges to zero.
Slutsky’s theorem tells us that
\[ X_n \rightarrow X \text{ (distr) and } Y_n \rightarrow a \in \mathbb{R}
\text{ (prob)}
\Rightarrow X_n Y_n \rightarrow aX \text{ (distr)} \]
Therefore $Y_n - \theta \rightarrow 0$ (distr).
Moreover,
\[ X_n \rightarrow a \in \mathbb{R} \text{ (distr)}
\Rightarrow X_n \rightarrow a \text{ (prob)} \]
Indeed, $\mathbb{P}[|X_n-a| \geq \varepsilon] = 1-F_n(a+\varepsilon) +
F_n(a-\varepsilon)$. Convergence to a constant in distribution means $F_n(x)$
converges to the Heaviside function $H(x-a)$. Therefore $F_n(a+\varepsilon)
\rightarrow 1$ and $F_n(a-\varepsilon) \rightarrow 0$.
Going back to our original proof, we have $Y_n - \theta \rightarrow 0$ (prob),
or equivalently $Y_n \rightarrow \theta$ (prob).
02 May 2018
I want to illustrate how correlation and covariance relates to the standard
deviation of the random variables they involve.
Let’s introduce two random variables $X_i$, $i=1,2$, and the centered random
variables $Y_i = X_i - \mathbb{E}[X_i]$. Then we have
\(\mathbb{E}[Y_i] = 0\), \(Var[Y_i] = Var(X_i)\), \(Cov(Y_1, Y_2) =
Cov(X_1,X_2)\).
We then introduce two new random variables, $\tilde{X}_i$, with the same
expectation as $X_i$, but with a re-scaled standard deviation,
\[\begin{align}
\tilde{X}_i & = \mathbb{E}[X_i] + a_i Y_i \notag \\
& = \mathbb{E}[X_i] + a_i (X_i - \mathbb{E}[X_i])
\end{align}\]
with $a_i \in \mathbb{R}$. We can immediately verify that
$\mathbb{E}[\tilde{X}_i] = \mathbb{E}[X_i]$ and $Var[\tilde{X}_i] = a_i^2
Var[X_i]$.
Then, the covariance of $\tilde{X}_1$ and $\tilde{X}_2$ gives
\[\begin{align}
Cov(\tilde{X}_1,\tilde{X}_2) & =
\mathbb{E}[(\tilde{X}_1-\mathbb{E}[\tilde{X}_1])(\tilde{X}_2-\mathbb{E}[\tilde{X}_2])]
\notag
\\
& = a_1a_2 \mathbb{E}[Y_1 Y_2] \notag \\
& = a_1a_2 Cov(X_1,X_2)
\end{align}\]
Whereas the correlation between $\tilde{X}_1$ and $\tilde{X}_2$ is given by
\[\begin{align}
corr(\tilde{X}_1,\tilde{X}_2) & =
\frac{Cov(\tilde{X}_1,\tilde{X}_2)}{\sqrt{Var[\tilde{X}_1] Var[\tilde{X}_2]}}
\notag \\
& = \frac{Cov({X}_1,{X}_2)}{\sqrt{Var[{X}_1] Var[{X}_2]}} \notag \\
& =
corr({X}_1,{X}_2)
\end{align}\]
In words, the covariance scales with the standard deviation of each random
variable, whereas the correlation is oblivious to the standard deviations.
15 Feb 2018
I downloaded the Netflix dataset from
Kaggle. The Netflix
dataset is a list of movie ratings entered by different customers. It was
provided by Netflix for the Netflix
Prize competition.
The goal of the competition was to predict missing movie ratings.
So far, I have been
playing with
the data to format them in an interesting way. My approach is
- create column for movieID by copying custID and removing all entries not
finishing by ‘:’, then extend movieID ‘ffill’.
- create sparse matrix with movie ratings for each customer
- create SparseDataFrame from sparse matrix
An interesting question is how do you reconcile ratings from different
customers. One approach is to normalize the ratings for each customer, by
substracting the mean rating and dividing by the standard deviation of the
ratings for that customer.
Also, as I don’t want to have to deal with pathological cases, I am going to
remove all customers with a single rating.
Unfortunately, the SparseDateFrame object appears to be extremely slow for any
sort of operation (subtract, mean,…). I ended making all operations prior to
the transformation to sparse dataframe.
04 Feb 2018
General
The goal of supervised learning is to use a set of measures, called inputs, to
predict some outputs. Inputs and outputs are sometimes also called
inputs |
outputs |
predictors |
responses |
independent variables |
dependent variables |
features |
|
Categories of outputs
Outputs can be of two types,
- quantitative: outputs are numbers, with an ordering. Prediction with
quantitative outputs is called regression.
- qualitative/categorical/discrete: outputs are sets, with not necessarily
an order relationship. Prediction with qualitative outputs is called
classification.
Notations
- Input variables are denoted by $X=\{X_j\}_j$ (vector).
- Outputs are denoted by $Y$ (quantitative) or $G$ (qualitative).
- Predicted or estimated quantities are marked with a hat.
- $X,Y,G$ are generic variables. The actual observed values are denoted in
lowercase; the $i^\text{th}$ observed input value is $x_i$ (potentially a
vector).