Pycharm with ssh interpreter

If you have a Docker image running on a remote server, you can set up PyCharm to use the python interpreter in that image locally. To do so, you go to Preferences > Project > Project Interpreter. You then select the SSH Interpreter option. Then you need to set up your connection, indicate the right port. There is more on this in this article.

Anyway, once this all done properly, I was still having some issue with running pytest with this ssh interpreter. pytest would point to my local path. I had to define a path mapping between my local path and my remote path. To do so, you go to Preferences > Project > Project Interpreter. Below the project interpreter, you see Path mappings, and you can define one from your local to your remote. After that, pytest should be able to find your test.

Installing cmake 3.17.3 inside a Docker image

Installation of xgboost requires cmake of version at least 3.17.3. And apt-get was only installing cmake 3.5.x, or something like that. A nice solution is described in this post. I picked the second approach, from the binary. This is the code I included in my Dockerfile:

RUN mkdir /opt/cmake && \
    cd /opt/cmake && \
    wget https://github.com/Kitware/CMake/releases/download/v3.17.3/cmake-3.17.3-Linux-x86_64.sh && \
    bash cmake-3.17.3-Linux-x86_64.sh --skip-license && \
    ln -s /opt/cmake/bin/cmake /usr/local/bin/cmake && \
    echo $(cmake --version)

The last line is just to check that it’s working. Also, note that you can install different versions of cmake. You can find your favorite one on their website.

Passing environment variables to you Docker build

It can be useful to pass environment variable from your local environment to your Docker build. This situation happened when I had to pass pypi keys to install specific packages. But since Docker is encapsulated, you need to take a couple of steps to make it happen. Note that I’m assuming you’re building your Docker image through a Dockerfile

  1. You need to pass the environment variables to your build command using the flag --build-arg. For instance,
    docker build --build-arg DOCKER_ENV_VAR=$MY_LOCAL_ENV_VAR -f Dockerfile -t my_image:my_tag .
    
  2. You need to define these variables in your Dockerfile. Continuing on the previous example, this would mean adding the following line in your Dockerfile:
    ARG DOCKER_ENV_VAR
    ...
    RUN apt-get install <something> --flag=$DOCKER_ENV_VAR
    

pip-tools

There are a few ways to manage dependencies: conda, poetry, pipenv. I recently discovered a different way, pip-tools. It’s actually very easy to use and in particular easy to integrate with a docker image. You simply create a requirements.in file which pip-compile converts to a requirements.txt file that you can then pip install inside your image by doing pip install -r requirements.txt.

There are multiple comparisons of poetry, pipenv, and pip-tools out there, including this one that compares specifically in the context of combining with docker, and that one that did a dec 2019 update and still declares pip-tools the winner. I also found that blog post useful as it shows a quick example of how to write a requirements.in.

You can install pip-tools through pip, pip install pip-tools. The only things you need to be careful with are the python version and OS you use to convert you requirements.in file to requirements.txt file. These needs to be the same as what you’ll use for your virtual environment. With Docker, this can be controlled by applying pip-tools inside a running container, then re-building that image.

Bayesian Networks

Bayesian Networks are probabilistic graphical models that offer a convenient, compact way of representing joint probability distribution. A Bayesian Network consists of a Directed Acyclic Graph (DAG) that connects different parameters (node), each edge indicating a dependence (an edge from node A to node B if the variable A helps explain B). Each node (random variable) is associated a distribution in the form of a Conditional Probability Distribution (CPD), the condition being on all the parent of that node. By definition, Bayesian Networks do not contain cycles. Which is not the case of Markov Random Fields. For that reason, Bayesian Networks are most often used when one tries to understand a causal relationship between the variables.

The construction of a Bayesian Networks involve at least 2 steps:

  1. Generating the structure of the DAG (i.e., what nodes are connected and in what direction). That is what the DAGs with NO TEARS algorithm does, in an efficient wayi (along with code).
  2. Estimating the CPDs. This can be done by MLE or Bayesian estimation.

The website for the Quantum Black library causalnex contains a brief introduction to Bayesian Networks. A longer, mode in-depth explanation can be found in this Stanford class on Probabilistic Graphical Models. An in-between solution might be to look at the slides for these two presentations 1 and 2. For sequential or temporal models, Dynamic Bayesian Networks were developped. Kevin Muprhy has a tutorial on his webpage.