Cookbook for Pandas

I end up looking for how to do things in pandas way too often. So I’ll summarize some basic features that I need.

  • datetime
    • converting to datetime format
      df['time'] = pd.to_datetime(df['time'])
      

      It can also be used on the index

      df.index = pd.to_datetime(df.index)
      
    • Shift a timeindex by constant time
      df.index = df.index - datetime.timedelta(hours=6)
      
    • Change sampling frequency; for instance, going from 1min data to 20min data
      df.set_index('time', inplace=True)
      df20 = df['open'].resample('20min').first()
      df20['close'] = df['close'].resample('20min').last()
      df20['mean'] = 0.5*(df['open'].resample('20min').mean() + df['close'].resample('20min').mean() )
      df20['low'] = df['low'].resample('20min').min()
      df20['high'] = df['high'].resample('20min').max()
      

      Many different frequencies can be used with resample, the main ones being

      M   monthly
      W   weekly
      D   daily
      H   hourly
      

      A nice reference is available here.

  • Normalize data (e.g, ‘volume’) by their daily maximum; assuming the index is every minute,
    df['date'] = df.index.date
    df.volume / df.groupby('date').volume.transform(np.max)
    
  • Plot results (‘res’) of a groupby with timeindex, all on top of each other. The trick here is to turn off the use of the index when plotting,
    df.groupby('date').res.plot(use_index=False)
    plt.show()
    
  • take negation of a boolean index selection: ~. For instance, the following example sets all entries in the ‘movie’ column to NaN where they do not end with a colon ‘:’,
    df.loc[~df['movie'].str.endswith(':'),'movie'] = np.NaN
    

Tags page

I had some issue formatting a page containing all tags used in the blog. All the coding had already been done in a blog post that I followed. However, instead of having all tags written next to each other on the same line, they were placed on top of each other, which would take a lot more screen space. I didn’t really dig to the bottom of it, but it turns out this can be fixed by defining the tag page as an html file, instead of a markdown file (md).

Gaussian measures in high dimension

High dimensional spaces can be counter-intuitive. For instance, most of the mass of a sphere in high dimension is concentrated in the outside region of the sphere. To see this, remember that the volume of a n-sphere is given by \[ V_n(r) = \frac{\pi^{n/2}}{\Gamma(n/2+1)} r^n. \] With this formula, we find that the percentage of mass of a unit n-sphere found in-between two spheres of radii 0.99 and 1.00 (i.e., a shell of thickness 0.01) is equal to $10\%$ in $\mathcal{R}^{10}$, $39\%$ in $\mathcal{R}^{50}$, and $63\%$ in $\mathcal{R}^{100}$.

This example also helps explain why samples from a multivariate Gaussian in $\mathcal{R}^n$ tend to accumulate on a sphere of radius $\sqrt{n}$ when $n$ grows large. The pdf of a Gaussian decreases exponentially fast as distance from the mean grows, but the mass of the space becomes more and more concentrate. The result is that the mode of a multivariate standard normal is at a distance $\sqrt{n-1}$.

For a multivariate standard normal ($\mu=0$, $\Sigma=I$) in $\mathcal{R}^n$, the $l_2-$norm of the samples has a chi distribution with parameter $n$, $\chi_n$; it has mode $\sqrt{n-1}$, and a mean that is approximately equal to $\sqrt{n}$ as $n$ grows. This means the variance, given by $n$ minus the square of the mean, is approximately equal to zero. Empirically, with 1,000 samples, I found the standard deviation of the samples to remain around 0.7 for dimensions $n=$1 to 100,000. A more rigorous explanation that almost all samples remain in a band of constant thickness can be found here.

Side note

As a ‘fun’ little exercise, we can re-derive the distribution of the $l_2-$norm of a multivariate standard normal. Assume $X \sim \mathcal{N}(0,I)$, then \[ \mathcal{P}(|X| \leq \alpha) = \iint_{|x| \leq \alpha^2} (2\pi)^{-n/2} \exp(-|x|^2/2) dx \] We turn to spherical coordinates, \[ x_1 = r \cos \theta_1 \] \[ x_2 = r \sin \theta_1 \cos \theta_2 \] \[ \vdots \] \[ x_i = r \sin \theta_1 \ldots \sin \theta_{i-1} \cos \theta_i \] \[ \vdots \] \[ x_n = r \sin \theta_1 \ldots \ldots \sin \theta_{n-2} \sin \theta_n \] And the Jacobian of the transformation is given by \[ \left| \frac{\partial x}{\partial \theta} \right| = r^{n-1} \sin^{n-2} \theta_1 \ldots \sin \theta_{n-2} \] However the exact form of the Jacobian does not matter. Since the pdf of the multivariate standard normal only depends on the radius, $\mathcal{P}(|X|\leq \alpha)$ will be given by \[ \mathcal{P}(|X|\leq \alpha) = S_n (2\pi)^{-n/2} \int_0^{\alpha} r^{n-1} e^{-r^2/2} dr , \] where $S_n$ is the part of the volume of unit n-sphere that does not depend on the radius, i.e., all the integrals for the $\theta_i$’s. Again, we use the formula for the volume of a unit n-sphere to get \[ \frac{\pi^{n/2}}{\Gamma(n/2+1)} = S_n \int_0^1 r^{n-1} dr \] Using the definition of the gamma function we re-write $S_n$ as \[ S_n = \frac{n \pi^{n/2}}{n/2 \, \Gamma(n/2)} \] and finally get \[ \mathcal{P}(|X|\leq \alpha) = \frac{1}{2^{n/2-1} \Gamma(n/2)} \int_0^{\alpha} r^{n-1} e^{-r^2/2} dr , \] which corresponds to a chi distribution.

Hypothesis testing with linear regression

AIC in linear regression

Model selection is a central problem of statistical inference, whether you have to choose the independent variables to include in your linear regression, or compare between completely unrelated techniques. In the context of linear regression, using $R^2$ to compare models would lead to over-fitting, and even the adjusted $R^2$ does not necessarily penalize larger models sufficiently. An army of other methods exist (see here), among which 3 really stands out:

  • cross-validation: it comes in many different flavours, but at its core, the idea is to leave out some of your training data to later test your model on. You can repeat the procedure leaving different distinct datasets, or not. One obvious disadvantage of cross-validation is its potential computational cost, especially if you want to exhaust all possible splitting of the parameter space. However, in the case of linear regression, the cross-validation error can be computed directly after solving the regression. Also, cross-validation is more restricted when applied to time-series data.
  • Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC): these two criteria do explicitly penalize models with large number of parameters.
    • AIC: It is generally given as \[ AIC(\beta) = 2k - 2\ln(\mathcal{L}) \] where $k$ is the number of parameters and $\mathcal{L}$ is the likelihood at the MLE.
    • BIC: the BIC criterion imposes a stronger penalty on the number of parameters of the model, \[ BIC(\beta) = \ln(n) k - 2\ln(\mathcal{L}) \] These two criteria are extremely popular; one disadvantage is their reliance on the Likelihood function which requires a specific distributional assumption.

Just as an exercise, let us derive the AIC in the case of linear regression. For the linear regression, assuming Gaussian noise $\varepsilon \sim \mathcal{N}(0, \Sigma)$, we have $ Y = X \cdotp \beta + \beta $. The log-likelihood function is then given by \[ \ln(\mathcal{L}) = -\frac12 \left( \ln( \det(2\pi \Sigma) ) \right) - \frac12 (Y - X \cdotp \beta)^T \Sigma^{-1} (Y - X \cdotp \beta) \] In the case of iid normal noise, i.e., $\Sigma = \sigma^2 I$, we have \[ \ln(\mathcal{L}) = -\frac12 \left( n \ln( 2\pi \sigma^2 ) \right) - \frac1{2\sigma^2} e^T e \] where $e = Y - X \cdotp \beta$ and $n$ is the number of observations. Let us denote by $\hat{\beta}$ the MLE (and $\hat{e}=Y-X \cdotp \hat{\beta}$) As we typically do not know the value of $\sigma^2$, we instead estimate it with the standard error $s^2 = \frac{e^Te}{n}$. This gives \[ AIC = 2k+n + n \ln(2\pi) + n \ln \left( \frac{\hat{e}^T\hat{e}}{n} \right) = n \left[ 1 + 2\ln(2\pi) + \frac{2k}n + \ln \left( \frac{\hat{e}^T\hat{e}}{n} \right) \right]. \] As a comparision, in Greene (2005) they use $2k/n + \ln(e^Te/n)$, which is, up to constant terms that do not vary in-between models, the same. An additional reference for model selection in regression is here.

In a future post, I’d like to talk about model averaging.