Fast AI's Pytorch tutorial

Fast AI wrote a really nice pytorch tutorial which starts from a very manual implementation of a one hidden layer neural network trained on MNIST data, then gradually adds on pytorch built-in capabilities. It nicely shows off what pytorch can do, and how it simplifies the coding. The different steps involve using:

  • torch.nn.functional: provides already built-in function for most common activation functions and loss functions.
  • nn.Module: provides a class to define a neural network that can be inherited when defining your own NN.
  • nn.Linear: already defines a linear layer (also have convolution layers,….)
  • optim: provides a broad selection of optimizer that can be used to train your neural network. You still need to compute the gradient yourself, and zero out the gradient after the update step. The parameters of the NN are passed to the optimizer when being instantiated, which, in addition to telling the optimizer what parameters to optimize, also provide the optimizer with gradient information.
  • Dataset: provides a convenient way to manipulate datasets; allows to slice train and test sets together.
  • DataLoader: manages mini-batches automatically; you just give it a Dataset and a batch size.
  • conv2d: pytorch has a lot of layers already built-in that can be directly used when building an object of type nn.Module.
  • nn.Sequential: it is another way of defining a model, instead of nn.Module.

Literature on GANs

GANs are a popular types of generative models that attemps to approximate a probabilty distribution. It does so by setting up a game between 2 agents, one that is tasked with generating samples (the generator), and one that is tasked with deciding whether samples generated are real or fake (the discriminator). The tyical setup uses deep nets as both generator and discriminator, and each have different cost functions. They are trained by alternating optimization steps on each.

One criticism is that the distribution learned by GANs have a very-low support, and therefore only learn a low-dimensional subset of the target distribution. This is discussed in Do GANs actually learn the distribution? An empirical study.

There exists different architecture for GANs. In Are GANs Created Equal? A Large-Scale Study, the authors compare different GANs and find no consistent, meaningful difference among those. They also propose some ways to compare GANs; another paper in that direction is Assessing Generative Models via Precision and Recall.

This article discusses a few workaround to some of the biggest pitfalls of GANs, namely:

  • mode collapse
  • non-convergence (oscillation)
  • vanishing gradient (when discriminator is over-confident)

Unfortunately, I find most of the solutions to be ad-hoc solutions, like duct tape, not really trying to address the underlying problems. This article still has some merit, at least as a nice concise summary of the current state of research (e.g., summarize all cost functions used in GANs).

Literature on Deep Learning

Regularization

In Regularization for Deep Learning: A Taxonomy, the authors list and classify a large number of regularization techniques for DNN (as of October 2017).

One way Deep Learning avoids over-fitting is by building sparsity. The former can be achieved with $L_1$ regularization, or with a post hoc pruning, i.e., removing some neurons from the network after training is complete. This pruning can be completely random, but most likely will be targeted provided some sort of metric (magnitude of weights, gradient,…). In Targeted Dropout, the authors observe that dropout also promote sparity (in the sense of small numbers of high activations; see here), and they propose to dropout, with a higher probability, neurons that would be pruned in the post-training stage.

Dropout typically doesn’t work as well for CNN. In DropBlock: A regularization method for convolutional networks , the authors postulate that this is due to the spatial correlation between neurons in a CNN, and propose to drop units in a spatially correlated manner (DropBlock). They report better results.

A similar issue applied to LSTM-based networks, for which (traditional) dropout doesn’t work. In Recurrent neural network regularization, the authors introduce a modified way of applying dropout to network with LSTM cells; the key is to apply dropout only to non-recurrent connections, i.e., connections between different layers of LSTM cells.

Optimization

In Adaptive Methods for Nonconvex Optimization, the authors study convergence properties of scaled gradient-based methods, and highlight the benefit of gradually increasing the mini-batch size during training.

Network understanding

In Implicit Self-Regularization in Deep Neural Networks, the authors try to understand why DNN work so well and do not overfit by applying ranomd matrix theory to the eigenstructure of the last 2 layers of wide range of (fully connected) popular networks. Their findings include:

  • DNN are self-regularizing
  • self-regularize in different ways depending on whether it’s an old network (Tikhonov-like) or a more modern architecture (heaviy-tailed self-regularization).
  • connect batch size with self-regularzation properties (small batch-size are better self-regularizing).

Fitting power models

Market impact is often defined as relating the price difference to the volume (or POV) traded, i.e., \(S_T - S_0 = k v_T + \varepsilon\) That model assumes a normal price dynamics, which may or may not make sense depending on your time scale, but this could easily modify to assume a log-normal price dynamics by using the difference of the log of the prices instead.

However, another very common approach is to assume a power law for the market impact, i.e., something like \(S_T - S_0 = k v_T^\alpha + \varepsilon\) Now comes the questions of fitting that model. And this is what this post is about.

There are 2 approaches to go about that:

  1. you could fit directly using nonlinear regression techniqes.
  2. you could fit instead the log of that expression \(\log(S_T - S_0) = \log(k) + \alpha \log(v_T) + \varepsilon\) However, those two expressions are not equivalent, primarily because of their assumptions on the noise distributions. This is something that is explained in the introduction of that paper (note that the conclusions of that paper are heavily criticized by that other paper).

You can estimate the change in the noise variance when applying method 2 by using the Delta method. This is also discussed into that stackexchange question. In the most relevant answer, the suggestion is to first fit using the log-transform, then use the coefficients obtained with that method to start a nonlinear regression solved using Newton’s method.

Intro to pytorch

I want to use this post to summarize a few important facts about using pytorch to build and train deep neural nets.

Building a neural net

First a general note: pytorch uses tensors as its main dataframe. For most operations, tensors can be manipulated very much like a numpy array. But tensors can be set to a specific device (e.g., GPU), and tensors have certain data types that need to be understood. By default, tensors are single precision (i.e., torch.Tensor corresponds to torch.FloatTensor–see here); this is considered fine to train neural networks, but this will create some situation when trying to test the gradient (see below). You can find a summary of pytorch tensors basic functionalities here. There is also an interesting tutorial put together by FastAI which builds a simple NN in Python, then gradually adds on pytorch capabilities to simplify and clarify the code; this allows you to see what each component of pytorch do.

A few quick notes about pytorch:

  • a trailing _ indicates the operator is performed in-place.

You build a neural net by inheriting the torch class Module.

Forward

At it most basic, you only need to define the method forward(self, x) which defines the forward propagation of your neural net. Typically, all function(al)s you need to build your deep net are defined in the constructor, i.e. def __init__(self). To stack the layers of your neural net, you propagate the input variable (typically x) from one layer to the next one, and return it in the end. Torch provides default function(al)s to define each layer (convolutional, rnn, lstm, pooling, activation functions,…). The main reference for all those commands is provided in the pytorch documentation.

Below is an example that defines the AlexNet,

import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
	# 2d convolutional layer that takes 3 input images (3-colors)
	# returns 6 images,
	# and apply a convolution kernel of size 5x5
        self.conv1 = nn.Conv2d(3, 6, 5)
	# max pooling layer which only keeps max value in 2x2 squares (subsampling)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
	# bunch of linear layers (y = Wx + b)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
	# apply convolution kernel,
	# apply pointwise ReLU function
	# and max pooling
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
	# change shape to go from multiple small images,
	# to one long vector
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()

Parameters of the network

The parameters are stored in the iterator net.parameters(). You can convert that to a list and see that the length of the list is equal to the number of layers x 2, since you have parameters for the weights (weight, $W$) and the biases (bias, $b$), and those parameters are stored separately. Typically, the parameters are ordered, for each layer, as weights first, bias second. To inialize weights, you can either do it by hand (with the data method of the weight or bias components of each layer), or use one of default functions provided in pytorch (see here).

Loss function

First, note the typical distinction in ML between cost function and loss function. The cost function is the average of the loss function over all training examples (or all training examples in that batch). The loss function is how you measure the performance of your network against the labeled data.

The loss function is defined independently from the neural net. I think the separation is motivated by the fact that the loss function does not have any parameters to be trained. Pytorch comes with a large variety of loss functions, e.g., the cross-entropy,

criterion = nn.CrossEntropyLoss()

An optimizer

Same as for the loss function, you can use one of the many optimizers provided by pytorch, e.g., stochastic gradient descent with momentum,

optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

net, here, is the convolutional neural network we defined above. So you only need to pass the parameters of the net to the optimizer. But, these parameters contain also the gradient information (once calculated).

Note that for some optimizers (e.g., BFGS), you need to do something more complicated (see here).

Train the network

Basic procedure

Once we have defined a neural net, a loss function, and an optimizer, we can start training the network. To do so, we need (1) training data, (2) derivatives. You set up the iteration over the training data set. But once you have a mini-batch, you need to

  • compute the loss
    outputs = net(data)
    loss = criterion(outputs, labels)
    
  • calculate the gradient
    loss.backward()
    
  • apply one step of the optimizer:
    optimizer.step()
    

Looking at the gradient

You can look at the gradient for each layer, either directly

net.conv1.bias.grad

or through the parameters (here for the bias of the first layer),

ii = 0
for x in net.parameters():
    if ii == 1:
        print(x.grad.data)
    ii += 1

Gradient check

As an exercise, I decided to check the gradient of the neural net defined in the pytorch’s CIFAR10 tutorial, which is some variation of AlexNet. To set the values of a layer, you can do

mynet.layer_nb_x.<bias or weight>.data = pytorch.tensor(....,device=same_device_as_net)

So far, I actually only checked the gradient wrt the bias of the first convolutional layer. I compare against a finite-difference approximation (one-directional). The results are in the corresponding jupyter notebook stored on Borgy (/mnt/home/bencrestel/pytorch/tutorials/beginner_source/blitz/cifar10_tutorial.ipynb). To my great suprise, the gradient checks rather well when using double precision, but checks very poorly when using single precision.

I was wondering whether I was doing the right thing, but I found a bug report in the pytorch repo where some user reports trouble with gradient check. That issue is answered by one of the developpers by switching to double precision. Another, more explicit and more convincing report that single precision should be avoided for gradient check was found in Stanford’s online course on CNN for visual recognition. In the gradient checks, they warn the readers to use double precision. So it seems well accepted that in single precision, the gradient will not check. Now on the bright side, looking at the results in single precision, it seems the problem comes from the finite-difference check, not the analytical gradient, which sorts of make sense.

no_grad

You sometimes see code where they wrap some code, often layers or value updates, inside a with statement

w1 = torch.tensor(...)
with torch.no_grad():
    w1 -= learning_rate * grad_w1
    ...

It is necessary to tell torch to not track the layer update in the next backpropagation step. However, you do not need to do that if you access the weights of a layer through the data method,

net.layer1.weights.data = ...

Indeed, as explained in the comments of that code, tensor.data gives a tensor that shares the storage with tensor, but doesn’t track history.

train vs eval

You need to specify in what mode the network is, either train (when training) or eval (when you’re done with training and want to use the network). This is used, for instance, by some types of layers like BatchNorm or Dropout, that behave differently in training and evaluation phases.

Using CUDA

You need to explicitly transfer your data structures (neural net,…) onto the GPU, using the .to()command. To transfer the neural net, you do

gpu0 = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
net.to(gpu0)

You also need to transfer the input data onto the GPU if you want to do that, e.g.,

inputs, labels = inputs.to(gpu0), labels.to(gpu0)

Note that when applied to tensors .to(gpu0) creates a copy (unlike when applied to the model). This is described at the end of the CIFAR10 tutorial.

Another good reference for all CUDA matters is the documentation.

A first interesting observation with that CIFAR10 dataset is that for a given number of epochs, smaller mini-batches (e.g., default of 4) lead to higher accuracy, but will take longer to train than, for instance with mini-batches of 64 pics. However, since each epoch is faster with larger mini-batches, a more fair comparision should be for a fixed run time, in which case it seems larger mini-batches win (for that specific application).

Using more than 1 GPU

Now by default, pytorch will only use 1 GPU. If you want to use more than 1 GPU, you need to use DataParallel.