Sample mean, sample variance

Let $X_1,\ldots,X_n$ be a random sample from a population with mean $\mu$ and variance $\sigma^2 < \infty$. Any function of this random sample is called a statistic, and the probability distribution of a statistic is called the sampling distribution of that statistic. In this note, we look at two important statistics.

Sample mean

The sample mean is defined as \[ \bar{X} = \frac1n \sum_{i=1}^n X_i \] The sample mean is an unbiased estimator of the mean of the distribution, i.e., \[ \mathbb{E}[\bar{X}] = \frac1n \sum_{i=1}^n \mathbb{E}[X_i] = \mu\] And the variance of the sample mean is \(\begin{align} Var[\bar{X}] & = \mathbb{E}[\left( \frac1n \sum_{i=1}^n (X_i-\mu) \right)^2 ] \notag \\ & = \frac{1}{n^2} \left( \sum_{i=1}^n \mathbb{E}[ (X_i-\mu)^2] + \sum_{i j} \mathbb{E}[(X_i-\mu)(X_j-\mu)] \right) \notag \\ Var[\bar{X}] & = \frac{\sigma^2}n \end{align}\)

Sample variance

The sample variance is defined as \[ (S_n^2=) S^2 = \frac1{n-1} \sum_{i=1}^n (X_i - \bar{X})^2 \] The sample variance is an unbiased estimator of the variance of the distribution the samples are taken from, i.e., \(\begin{align} \mathbb{E}[S^2] & = \frac1{n-1} \sum_{i=1}^n \left( \mathbb{E}[X_i^2] - \frac2n \sum_{j=1}^n \mathbb{E}[X_i X_j] + \mathbb{E}[\bar{X}^2] \right) \notag\\ & = \frac1{n-1} \sum_{i=1}^n \left( \mu^2 + \sigma^2 - \frac2n (\sigma^2 + n \mu^2 ) + \frac1{n^2} (n(\mu^2+\sigma^2) + (n^2-n)\mu^2) \right) \notag\\ \mathbb{E}[S^2] & = \sigma^2 \end{align}\) However, because of Jensen’s inequality, \(\sqrt{S^2}\) is a biased estimator of the standard deviation of the distribution we sampled from. Indeed, square root being strictly concave over its domain, we have that \(\mathbb{E}[\sqrt{S^2}] < \sqrt{\mathbb{E}[S^2]} = \sigma\) (unless $\sigma=0$).

On the other hand, because square root is a continuous function over $\mathbb{R}^+$, if $S^2$ is a consistent estimator (converges in probability), $\sqrt{S^2}$ is also a consistent estimator (i.e., the bias disappears as the number of samples grow). Using Chebychev’s inequality, we see that \[ \mathbb{P}[|S_n^2 - \sigma^2| \geq \varepsilon] \leq \frac{Var[S_n^2]}{\varepsilon^2}, \notag \] that is $S^2_n$ is consistent if $Var[S_n^2]$ goes to zero as the number of samples increase.

Without a proof, one can show that with $\theta$ the fourth moment of the underlying distribution, we have \[ Var[S^2] = \frac1n (\theta - \frac{n-3}{n-1} \sigma^4) \]

Hypothesis testing

A common test for a random sample is to test the value of its mean. To do so, we can use the statistic \[ \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \] where $\mu$ is the value of the statistic we test for. For hypothesis testing, we need to know the distribution of that statistic. In the general case, we can only conclude asymptotically as the central limit theorem states that if the underlying distribution has finite variance, \[ \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \rightarrow \mathcal{N}(0,1) \] In practice, $\sigma$ is rarely known, and we want to replace it with $S$. We know that $S$ is a consistent estimator of $\sigma$, and we can show that $\sigma / S \rightarrow 1$. Then by Slutsky’s theorem, \[ \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \frac{\sigma}{S} \rightarrow \mathcal{N}(0,1) \]

However this result is only valid asymptotically. Often in practice, people assume the asymptotic regime is valid, and use a standard normal distribution for that statistic; but there is no general rule to know when that approximation is valid.

On the other hand, if the random samples come from a normal distribution $\mathcal{N}(\mu,\sigma^2)$, we can say something more general. In that case, \[ \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim \mathcal{N}(0,1) \] and \[ \frac{\bar{X} - \mu}{S / \sqrt{n}} \sim t_{n-1} \] a Student’s t-distribution with $n-1$ degrees of freedoms.

Delta method and convergence of random variables

Delta method

The Delta method can be used to approximate the variance of a function of a random variable that converges to a normal random variable. Assuming \(\sqrt{n} (Y_n - \theta) \rightarrow \mathcal{N}(0,\sigma^2)\) (distribution) with $\theta \in \mathbb{R}$, then for a given function $g$ (with $g’(\theta) \neq 0$), we have \[ \sqrt{n} (g(Y_n) - g(\theta)) \rightarrow \mathcal{N}(0, \sigma^2 [g’(\theta)]^2) \text{ (distribution)} \]

Convergence in distribution and convergence in probability

In the proof of the Delta method, we use the result that \[ \sqrt{n} (Y_n - \theta) \rightarrow \mathcal{N}(0,\sigma^2) \text{ (distribution)} \Rightarrow Y_n \rightarrow \theta \text{ (probability),} \] i.e., for any $\varepsilon > 0$, $\mathbb{P}[|Y_n-\theta| \geq \varepsilon] \rightarrow 0$ as $n$ grows. Now, let us introduce $X_n = \sqrt{n}(Y_n - \theta)$. Then \[ Y_n - \theta = \frac{1}{\sqrt{n}} X_n \] $X_n$ converges to $\mathcal{N}(0,\sigma^2)$ in distribution and $1/\sqrt{n}$ converges to zero. Slutsky’s theorem tells us that \[ X_n \rightarrow X \text{ (distr) and } Y_n \rightarrow a \in \mathbb{R} \text{ (prob)} \Rightarrow X_n Y_n \rightarrow aX \text{ (distr)} \] Therefore $Y_n - \theta \rightarrow 0$ (distr).

Moreover, \[ X_n \rightarrow a \in \mathbb{R} \text{ (distr)} \Rightarrow X_n \rightarrow a \text{ (prob)} \] Indeed, $\mathbb{P}[|X_n-a| \geq \varepsilon] = 1-F_n(a+\varepsilon) + F_n(a-\varepsilon)$. Convergence to a constant in distribution means $F_n(x)$ converges to the Heaviside function $H(x-a)$. Therefore $F_n(a+\varepsilon) \rightarrow 1$ and $F_n(a-\varepsilon) \rightarrow 0$.

Going back to our original proof, we have $Y_n - \theta \rightarrow 0$ (prob), or equivalently $Y_n \rightarrow \theta$ (prob).

Correlation, covariance, and standard deviation

I want to illustrate how correlation and covariance relates to the standard deviation of the random variables they involve.

Let’s introduce two random variables $X_i$, $i=1,2$, and the centered random variables $Y_i = X_i - \mathbb{E}[X_i]$. Then we have \(\mathbb{E}[Y_i] = 0\), \(Var[Y_i] = Var(X_i)\), \(Cov(Y_1, Y_2) = Cov(X_1,X_2)\). We then introduce two new random variables, $\tilde{X}_i$, with the same expectation as $X_i$, but with a re-scaled standard deviation,

\[\begin{align} \tilde{X}_i & = \mathbb{E}[X_i] + a_i Y_i \notag \\ & = \mathbb{E}[X_i] + a_i (X_i - \mathbb{E}[X_i]) \end{align}\]

with $a_i \in \mathbb{R}$. We can immediately verify that $\mathbb{E}[\tilde{X}_i] = \mathbb{E}[X_i]$ and $Var[\tilde{X}_i] = a_i^2 Var[X_i]$.

Then, the covariance of $\tilde{X}_1$ and $\tilde{X}_2$ gives

\[\begin{align} Cov(\tilde{X}_1,\tilde{X}_2) & = \mathbb{E}[(\tilde{X}_1-\mathbb{E}[\tilde{X}_1])(\tilde{X}_2-\mathbb{E}[\tilde{X}_2])] \notag \\ & = a_1a_2 \mathbb{E}[Y_1 Y_2] \notag \\ & = a_1a_2 Cov(X_1,X_2) \end{align}\]

Whereas the correlation between $\tilde{X}_1$ and $\tilde{X}_2$ is given by

\[\begin{align} corr(\tilde{X}_1,\tilde{X}_2) & = \frac{Cov(\tilde{X}_1,\tilde{X}_2)}{\sqrt{Var[\tilde{X}_1] Var[\tilde{X}_2]}} \notag \\ & = \frac{Cov({X}_1,{X}_2)}{\sqrt{Var[{X}_1] Var[{X}_2]}} \notag \\ & = corr({X}_1,{X}_2) \end{align}\]

In words, the covariance scales with the standard deviation of each random variable, whereas the correlation is oblivious to the standard deviations.

Playing with the Netflix dataset

I downloaded the Netflix dataset from Kaggle. The Netflix dataset is a list of movie ratings entered by different customers. It was provided by Netflix for the Netflix Prize competition. The goal of the competition was to predict missing movie ratings.

So far, I have been playing with the data to format them in an interesting way. My approach is

  1. create column for movieID by copying custID and removing all entries not finishing by ‘:’, then extend movieID ‘ffill’.
  2. create sparse matrix with movie ratings for each customer
  3. create SparseDataFrame from sparse matrix

An interesting question is how do you reconcile ratings from different customers. One approach is to normalize the ratings for each customer, by substracting the mean rating and dividing by the standard deviation of the ratings for that customer. Also, as I don’t want to have to deal with pathological cases, I am going to remove all customers with a single rating.

Unfortunately, the SparseDateFrame object appears to be extremely slow for any sort of operation (subtract, mean,…). I ended making all operations prior to the transformation to sparse dataframe.

Terminology and notations in supervised learning

General

The goal of supervised learning is to use a set of measures, called inputs, to predict some outputs. Inputs and outputs are sometimes also called

inputs outputs
predictors responses
independent variables dependent variables
features  

Categories of outputs

Outputs can be of two types,

  • quantitative: outputs are numbers, with an ordering. Prediction with quantitative outputs is called regression.
  • qualitative/categorical/discrete: outputs are sets, with not necessarily an order relationship. Prediction with qualitative outputs is called classification.

Notations

  • Input variables are denoted by $X=\{X_j\}_j$ (vector).
  • Outputs are denoted by $Y$ (quantitative) or $G$ (qualitative).
  • Predicted or estimated quantities are marked with a hat.
  • $X,Y,G$ are generic variables. The actual observed values are denoted in lowercase; the $i^\text{th}$ observed input value is $x_i$ (potentially a vector).