23 Jan 2018
I end up looking for how to do things in pandas way too often. So I’ll summarize
some basic features that I need.
- datetime
- converting to datetime format
df['time'] = pd.to_datetime(df['time'])
It can also be used on the index
df.index = pd.to_datetime(df.index)
- Shift a timeindex by constant time
df.index = df.index - datetime.timedelta(hours=6)
- Change sampling frequency; for instance, going from 1min data to 20min data
df.set_index('time', inplace=True)
df20 = df['open'].resample('20min').first()
df20['close'] = df['close'].resample('20min').last()
df20['mean'] = 0.5*(df['open'].resample('20min').mean() + df['close'].resample('20min').mean() )
df20['low'] = df['low'].resample('20min').min()
df20['high'] = df['high'].resample('20min').max()
Many different frequencies can be used with resample
, the main ones being
M monthly
W weekly
D daily
H hourly
A nice reference is available
here.
- Normalize data (e.g, ‘volume’) by their daily maximum; assuming the index is every minute,
df['date'] = df.index.date
df.volume / df.groupby('date').volume.transform(np.max)
- Plot results (‘res’) of a groupby with timeindex, all on top of each other. The trick
here is to turn off the use of the index when plotting,
df.groupby('date').res.plot(use_index=False)
plt.show()
- take negation of a boolean index selection:
~
.
For instance, the following example sets all entries in the ‘movie’ column to
NaN where they do not end with a colon ‘:’,
df.loc[~df['movie'].str.endswith(':'),'movie'] = np.NaN
21 Jan 2018
I had some issue formatting a page containing all tags used in the blog. All
the coding had already been done in a blog post
that I followed. However, instead of having all tags written next to each other
on the same line, they were placed on top of each other, which would take a lot
more screen space. I didn’t really dig to the bottom of it, but it turns out
this can be fixed by defining the tag page as an html file, instead of a
markdown file (md).
20 Jan 2018
High dimensional spaces can be counter-intuitive. For instance,
most of the mass
of a sphere in high dimension is concentrated in the outside region of the sphere.
To see this, remember that the volume of a
n-sphere is
given by
\[ V_n(r) = \frac{\pi^{n/2}}{\Gamma(n/2+1)} r^n. \]
With this formula, we find that the percentage of mass of a unit n-sphere found
in-between two spheres of radii 0.99 and 1.00 (i.e., a shell of thickness 0.01)
is equal to $10\%$ in $\mathcal{R}^{10}$, $39\%$ in $\mathcal{R}^{50}$, and $63\%$
in $\mathcal{R}^{100}$.
This example also helps explain why samples from a multivariate Gaussian
in $\mathcal{R}^n$ tend to accumulate on a sphere of radius $\sqrt{n}$ when $n$
grows large.
The pdf of a Gaussian decreases exponentially fast as distance from the mean
grows, but the mass of the space becomes more and more concentrate. The result
is that the mode of a multivariate standard normal is at a distance
$\sqrt{n-1}$.
For a multivariate standard normal ($\mu=0$, $\Sigma=I$) in $\mathcal{R}^n$, the $l_2-$norm of the
samples has a chi distribution
with parameter $n$, $\chi_n$; it has mode $\sqrt{n-1}$, and a mean that is
approximately equal to $\sqrt{n}$ as $n$ grows. This means the variance, given
by $n$ minus the square of the mean, is approximately equal to zero.
Empirically, with 1,000 samples, I found the standard deviation of the samples
to remain around 0.7 for dimensions $n=$1 to 100,000. A more rigorous
explanation that almost all samples remain in a band of constant thickness
can be found
here.
Side note
As a ‘fun’ little exercise, we can re-derive the distribution of the $l_2-$norm
of a multivariate standard normal. Assume $X \sim \mathcal{N}(0,I)$, then
\[ \mathcal{P}(|X| \leq \alpha) = \iint_{|x| \leq \alpha^2} (2\pi)^{-n/2}
\exp(-|x|^2/2) dx \]
We turn to spherical coordinates,
\[ x_1 = r \cos \theta_1 \]
\[ x_2 = r \sin \theta_1 \cos \theta_2 \]
\[ \vdots \]
\[ x_i = r \sin \theta_1 \ldots \sin \theta_{i-1} \cos \theta_i \]
\[ \vdots \]
\[ x_n = r \sin \theta_1 \ldots \ldots \sin \theta_{n-2} \sin \theta_n \]
And the Jacobian of the transformation is given
by
\[ \left| \frac{\partial x}{\partial \theta} \right| = r^{n-1} \sin^{n-2}
\theta_1 \ldots \sin \theta_{n-2} \]
However the exact form of the Jacobian does not matter. Since the pdf of the
multivariate standard normal only depends on the radius, $\mathcal{P}(|X|\leq
\alpha)$ will be given by
\[ \mathcal{P}(|X|\leq \alpha) = S_n (2\pi)^{-n/2} \int_0^{\alpha} r^{n-1}
e^{-r^2/2} dr , \]
where $S_n$ is the part of the volume of unit n-sphere that does not depend on
the radius, i.e., all the integrals for the $\theta_i$’s. Again, we use the
formula for the volume of a unit n-sphere to get
\[ \frac{\pi^{n/2}}{\Gamma(n/2+1)} = S_n \int_0^1 r^{n-1} dr \]
Using the definition of the
gamma function
we re-write $S_n$ as
\[ S_n = \frac{n \pi^{n/2}}{n/2 \, \Gamma(n/2)} \]
and finally get
\[ \mathcal{P}(|X|\leq \alpha) =
\frac{1}{2^{n/2-1} \Gamma(n/2)} \int_0^{\alpha} r^{n-1}
e^{-r^2/2} dr , \]
which corresponds to a chi distribution.
12 Jan 2018
Model selection is a central problem of statistical inference, whether you have to choose the independent variables to include in your linear regression, or compare between completely unrelated techniques.
In the context of linear regression, using $R^2$ to compare models would lead to over-fitting, and even the adjusted $R^2$ does not necessarily penalize larger models sufficiently.
An army of other methods exist (see here), among which 3 really stands out:
- cross-validation: it comes in many different flavours, but at its core, the idea is to leave out some of your training data to later test your model on. You can repeat the procedure leaving different distinct datasets, or not.
One obvious disadvantage of cross-validation is its potential computational cost, especially if you want to exhaust all possible splitting of the parameter space.
However, in the case of linear regression, the cross-validation error can be
computed directly after solving the regression.
Also, cross-validation is more restricted when applied to time-series data.
- Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC): these two criteria do explicitly penalize models with large number of parameters.
- AIC: It is generally given as
\[ AIC(\beta) = 2k - 2\ln(\mathcal{L}) \]
where $k$ is the number of parameters and $\mathcal{L}$ is the likelihood at the MLE.
- BIC: the BIC criterion imposes a stronger penalty on the number of parameters of the model,
\[ BIC(\beta) = \ln(n) k - 2\ln(\mathcal{L}) \]
These two criteria are extremely popular; one disadvantage is their reliance on the Likelihood function which requires a specific distributional assumption.
Just as an exercise, let us derive the AIC in the case of linear regression. For the linear regression, assuming Gaussian noise $\varepsilon \sim \mathcal{N}(0, \Sigma)$, we have $ Y = X \cdotp \beta + \beta $.
The log-likelihood function is then given by
\[ \ln(\mathcal{L}) = -\frac12 \left( \ln( \det(2\pi \Sigma) ) \right) - \frac12 (Y - X \cdotp \beta)^T \Sigma^{-1} (Y - X \cdotp \beta) \]
In the case of iid normal noise, i.e., $\Sigma = \sigma^2 I$, we have
\[ \ln(\mathcal{L}) = -\frac12 \left( n \ln( 2\pi \sigma^2 ) \right) - \frac1{2\sigma^2} e^T e \]
where $e = Y - X \cdotp \beta$ and $n$ is the number of observations.
Let us denote by $\hat{\beta}$ the MLE (and $\hat{e}=Y-X \cdotp \hat{\beta}$)
As we typically do not know the value of $\sigma^2$, we instead estimate it with the standard error $s^2 = \frac{e^Te}{n}$. This gives
\[ AIC = 2k+n + n \ln(2\pi) + n \ln \left( \frac{\hat{e}^T\hat{e}}{n} \right) = n \left[ 1 + 2\ln(2\pi) + \frac{2k}n + \ln \left( \frac{\hat{e}^T\hat{e}}{n} \right) \right]. \]
As a comparision, in Greene (2005) they use $2k/n + \ln(e^Te/n)$, which is, up to constant terms that do not vary in-between models, the same.
An additional reference for model selection in regression is here.
In a future post, I’d like to talk about model averaging.