Intro to pytorch

I want to use this post to summarize a few important facts about using pytorch to build and train deep neural nets.

Building a neural net

First a general note: pytorch uses tensors as its main dataframe. For most operations, tensors can be manipulated very much like a numpy array. But tensors can be set to a specific device (e.g., GPU), and tensors have certain data types that need to be understood. By default, tensors are single precision (i.e., torch.Tensor corresponds to torch.FloatTensor–see here); this is considered fine to train neural networks, but this will create some situation when trying to test the gradient (see below). You can find a summary of pytorch tensors basic functionalities here. There is also an interesting tutorial put together by FastAI which builds a simple NN in Python, then gradually adds on pytorch capabilities to simplify and clarify the code; this allows you to see what each component of pytorch do.

A few quick notes about pytorch:

You build a neural net by inheriting the torch class Module.

Forward

At it most basic, you only need to define the method forward(self, x) which defines the forward propagation of your neural net. Typically, all function(al)s you need to build your deep net are defined in the constructor, i.e. def __init__(self). To stack the layers of your neural net, you propagate the input variable (typically x) from one layer to the next one, and return it in the end. Torch provides default function(al)s to define each layer (convolutional, rnn, lstm, pooling, activation functions,…). The main reference for all those commands is provided in the pytorch documentation.

Below is an example that defines the AlexNet,

import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
	# 2d convolutional layer that takes 3 input images (3-colors)
	# returns 6 images,
	# and apply a convolution kernel of size 5x5
        self.conv1 = nn.Conv2d(3, 6, 5)
	# max pooling layer which only keeps max value in 2x2 squares (subsampling)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
	# bunch of linear layers (y = Wx + b)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
	# apply convolution kernel,
	# apply pointwise ReLU function
	# and max pooling
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
	# change shape to go from multiple small images,
	# to one long vector
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()

Parameters of the network

The parameters are stored in the iterator net.parameters(). You can convert that to a list and see that the length of the list is equal to the number of layers x 2, since you have parameters for the weights (weight, $W$) and the biases (bias, $b$), and those parameters are stored separately. Typically, the parameters are ordered, for each layer, as weights first, bias second. To inialize weights, you can either do it by hand (with the data method of the weight or bias components of each layer), or use one of default functions provided in pytorch (see here).

Loss function

First, note the typical distinction in ML between cost function and loss function. The cost function is the average of the loss function over all training examples (or all training examples in that batch). The loss function is how you measure the performance of your network against the labeled data.

The loss function is defined independently from the neural net. I think the separation is motivated by the fact that the loss function does not have any parameters to be trained. Pytorch comes with a large variety of loss functions, e.g., the cross-entropy,

criterion = nn.CrossEntropyLoss()

An optimizer

Same as for the loss function, you can use one of the many optimizers provided by pytorch, e.g., stochastic gradient descent with momentum,

optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

net, here, is the convolutional neural network we defined above. So you only need to pass the parameters of the net to the optimizer. But, these parameters contain also the gradient information (once calculated).

Note that for some optimizers (e.g., BFGS), you need to do something more complicated (see here).

Train the network

Basic procedure

Once we have defined a neural net, a loss function, and an optimizer, we can start training the network. To do so, we need (1) training data, (2) derivatives. You set up the iteration over the training data set. But once you have a mini-batch, you need to

Looking at the gradient

You can look at the gradient for each layer, either directly

net.conv1.bias.grad

or through the parameters (here for the bias of the first layer),

ii = 0
for x in net.parameters():
    if ii == 1:
        print(x.grad.data)
    ii += 1

Gradient check

As an exercise, I decided to check the gradient of the neural net defined in the pytorch’s CIFAR10 tutorial, which is some variation of AlexNet. To set the values of a layer, you can do

mynet.layer_nb_x.<bias or weight>.data = pytorch.tensor(....,device=same_device_as_net)

So far, I actually only checked the gradient wrt the bias of the first convolutional layer. I compare against a finite-difference approximation (one-directional). The results are in the corresponding jupyter notebook stored on Borgy (/mnt/home/bencrestel/pytorch/tutorials/beginner_source/blitz/cifar10_tutorial.ipynb). To my great suprise, the gradient checks rather well when using double precision, but checks very poorly when using single precision.

I was wondering whether I was doing the right thing, but I found a bug report in the pytorch repo where some user reports trouble with gradient check. That issue is answered by one of the developpers by switching to double precision. Another, more explicit and more convincing report that single precision should be avoided for gradient check was found in Stanford’s online course on CNN for visual recognition. In the gradient checks, they warn the readers to use double precision. So it seems well accepted that in single precision, the gradient will not check. Now on the bright side, looking at the results in single precision, it seems the problem comes from the finite-difference check, not the analytical gradient, which sorts of make sense.

no_grad

You sometimes see code where they wrap some code, often layers or value updates, inside a with statement

w1 = torch.tensor(...)
with torch.no_grad():
    w1 -= learning_rate * grad_w1
    ...

It is necessary to tell torch to not track the layer update in the next backpropagation step. However, you do not need to do that if you access the weights of a layer through the data method,

net.layer1.weights.data = ...

Indeed, as explained in the comments of that code, tensor.data gives a tensor that shares the storage with tensor, but doesn’t track history.

train vs eval

You need to specify in what mode the network is, either train (when training) or eval (when you’re done with training and want to use the network). This is used, for instance, by some types of layers like BatchNorm or Dropout, that behave differently in training and evaluation phases.

Using CUDA

You need to explicitly transfer your data structures (neural net,…) onto the GPU, using the .to()command. To transfer the neural net, you do

gpu0 = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
net.to(gpu0)

You also need to transfer the input data onto the GPU if you want to do that, e.g.,

inputs, labels = inputs.to(gpu0), labels.to(gpu0)

Note that when applied to tensors .to(gpu0) creates a copy (unlike when applied to the model). This is described at the end of the CIFAR10 tutorial.

Another good reference for all CUDA matters is the documentation.

A first interesting observation with that CIFAR10 dataset is that for a given number of epochs, smaller mini-batches (e.g., default of 4) lead to higher accuracy, but will take longer to train than, for instance with mini-batches of 64 pics. However, since each epoch is faster with larger mini-batches, a more fair comparision should be for a fixed run time, in which case it seems larger mini-batches win (for that specific application).

Using more than 1 GPU

Now by default, pytorch will only use 1 GPU. If you want to use more than 1 GPU, you need to use DataParallel.

[ ML  deeplearning  pytorch  ]