Webdev 101

Start a local web server

  • (optional) Start docker container and expose port 8000 docker run -it -p 8000:8000 --entrypoint bash -w /home -v $PWD:/home default:1.0
  • Start web server using Python (use port 8000 by default) python -m http.server
  • Access that web site from another machine on my network use URL http://<local_ip_address>:8000 you can find your local ip address by doing ipconfig getifaddr en0 when conected to wifi. Note that for some reason, you can’t use localhost on Chrome and instead need to use 127.0.0.1, which is the same in the end. The answer to that StackOverflow question provides a fix.

Goodhart's law

I recently discovered what is termed Goodhart’s law. It has different forms, but the most relevant for me goes as follows: “When a measure becomes a target, it ceases to be a good measure.” The first application that came to mind was overfitting in machine learning. And I am not alone. This blog post discusses the connection and the potential pitfall of massive grid search for hyperparameter search.àIIn practice, we’ll tend to use a test set after hp search to estimate the expected out-of-sample model performance. But I think this is still an interesting observation. Also, I feel this really applies when we select a model based on its performance on a downstream application, instead of on the first application of the model. For instance, in a forecast + optimization application, if we select the forecasting model not based on its forecasting ability, but on the performance of the entire pipeline. I’m convinced that will lead to severe cases of overfitting.

Notes on the Transformer architecture

Simon Prince put togethe a 3-parts series of blog posts about the Transformer architecture: I, II, and III. I found the articles very clear and useful. I want to save a few notes that I took about these blog posts.

We assume we have $I$ inputs. Inputs could be words, tokens (elementary words that can be used to form all words in the dictionary), or something else it doesn’t really matter. Each input is converted to its embedding which has dimension $D$. Such that, for NLP, if we want to pass a sentence, we can group each tokens in the sentence into a matrix $X \in \mathbb{R}^{I \times D}$.

Whereas attention was first introduced in the work of Bahdanau et al., 2015, it was formalized in the paper Attention is all you need which introduced the Transformer architecture. The main element is that the attention mechanism is split into 3 components: keys, queries, and values; vocabulary that comes from the field of information retrieval. Informally, the query is what you are requesting (eg, the output of the decoder at the current step), the keys are what the queries are going to be compared with, and the values are what is going to be combined to generate the output. The main aspect of the Transformer is the self-attention mechanism, which is to say that the same input is used for the queries and the keys. Before being combined, the input $X$ is linearly transformed to a lower dimensional (column-)space through the matrices $\Phi_v, \Phi_k, \Phi_q \in \mathbb{R}^{D \times \tilde{D}}$, to obtain

\[\begin{aligned} X \cdotp \Phi_v & = V \in \mathbb{R}^{I \times \tilde{D}} \\ X \cdotp \Phi_k & = K \in \mathbb{R}^{I \times \tilde{D}}\\ X \cdotp \Phi_q & = Q \in \mathbb{R}^{I \times \tilde{D}} \end{aligned}\]

Typically, $\tilde{D}$ is chosen such that it is lower than $D$, which reduces the computational cost of the linear algebra. Once this transformation is done, the weights to combine the values are calculated by checking the similarity between the query and the keys (via a dot-product), $Q \cdotp V^T \in \mathbb{R}^{I \times I}$, then normalizing the weights via a softmax (applied across each row independently), $\text{softmax}(Q \cdotp K^T) \in \mathbb{R}^{I \times I}$. Finally, we can apply these weights (that sum to $1$ for a given query, i.e., a given row) to the values

\[\text{softmax}(Q \cdotp K^T) V \in \mathbb{R}^{I \times \tilde{D}}\]

ssh access via ssh key

When accessing a remote server, you typically have to enter your password every time you connect (ssh, sshfs,…). If you access that server via the ssh protocol, you can instead identify via a ssh key.

It’s actually pretty easy to do. All you have to do is add you public ssh key to the file ~/.ssh/authorized_keys on the server side. On your laptop, you need to have your ssh key added to your keychain. Then every time you ssh to the server, the remote server will send you a challenge that can only be resolved by your private ssh key.

That article contains a lot of information about that topic.

Synthetic Data Generation

Synthetic data generation is a very popular topic these days, and I believe for a good reason. I found this blog post that list a few places that already offer synthetic data generation. Most of the applications listed are for computer vision or NLP, but there are few tabular applications which may be suitable for time series.

There was also this blog post from YData looking at the Time GAN paper from Cambridge.