Intro to pytorch
07 Jan 2019I want to use this post to summarize a few important facts about using pytorch to build and train deep neural nets.
Building a neural net
First a general note: pytorch uses
tensors as its main dataframe.
For most operations, tensors can be manipulated very much like a numpy array. But
tensors can be set to a specific device (e.g., GPU), and tensors have certain
data types that need to be understood. By default, tensors are single precision
(i.e., torch.Tensor
corresponds to torch.FloatTensor
–see here);
this is considered fine to train neural networks, but this will create some
situation when trying to test the gradient (see below).
You can find a summary of pytorch tensors basic functionalities
here.
There is also an interesting tutorial put together by FastAI which builds a
simple NN in Python, then gradually adds on pytorch capabilities to simplify and
clarify the code; this allows you to see what each component of pytorch do.
A few quick notes about pytorch:
- a trailing
_
indicates the operator is performed in-place.
You build a neural net by inheriting the torch class Module
.
Forward
At it most basic, you only need to define the method forward(self, x)
which
defines the forward propagation of your neural net. Typically, all function(al)s
you need to build your deep net are defined in the constructor, i.e. def
__init__(self)
.
To stack the layers of your neural net, you propagate the input variable
(typically x
) from one layer to the next one, and return it in the end.
Torch provides default function(al)s to define each layer (convolutional, rnn,
lstm, pooling, activation functions,…).
The main reference for all those commands is provided in the pytorch
documentation.
Below is an example that defines the AlexNet,
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# 2d convolutional layer that takes 3 input images (3-colors)
# returns 6 images,
# and apply a convolution kernel of size 5x5
self.conv1 = nn.Conv2d(3, 6, 5)
# max pooling layer which only keeps max value in 2x2 squares (subsampling)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
# bunch of linear layers (y = Wx + b)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
# apply convolution kernel,
# apply pointwise ReLU function
# and max pooling
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
# change shape to go from multiple small images,
# to one long vector
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
net = Net()
Parameters of the network
The parameters are stored in the iterator net.parameters()
. You can convert
that to a list and see that the length of the list is equal to the number of
layers x 2, since you have parameters for the weights (weight
, $W$) and the biases
(bias
, $b$), and those parameters are stored separately. Typically, the parameters are
ordered, for each layer, as weights first, bias second.
To inialize weights, you can either do it by hand (with the data
method of
the weight
or bias
components of each layer), or use one of default
functions provided in pytorch (see
here).
Loss function
First, note the typical distinction in ML between cost function and loss function. The cost function is the average of the loss function over all training examples (or all training examples in that batch). The loss function is how you measure the performance of your network against the labeled data.
The loss function is defined independently from the neural net. I think the separation is motivated by the fact that the loss function does not have any parameters to be trained. Pytorch comes with a large variety of loss functions, e.g., the cross-entropy,
criterion = nn.CrossEntropyLoss()
An optimizer
Same as for the loss function, you can use one of the many optimizers provided by pytorch, e.g., stochastic gradient descent with momentum,
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
net
, here, is the convolutional neural network we defined above. So you only
need to pass the parameters of the net to the optimizer. But, these parameters
contain also the gradient information (once calculated).
Note that for some optimizers (e.g., BFGS), you need to do something more complicated (see here).
Train the network
Basic procedure
Once we have defined a neural net, a loss function, and an optimizer, we can start training the network. To do so, we need (1) training data, (2) derivatives. You set up the iteration over the training data set. But once you have a mini-batch, you need to
- compute the loss
outputs = net(data) loss = criterion(outputs, labels)
- calculate the gradient
loss.backward()
- apply one step of the optimizer:
optimizer.step()
Looking at the gradient
You can look at the gradient for each layer, either directly
net.conv1.bias.grad
or through the parameters (here for the bias of the first layer),
ii = 0
for x in net.parameters():
if ii == 1:
print(x.grad.data)
ii += 1
Gradient check
As an exercise, I decided to check the gradient of the neural net defined in the pytorch’s CIFAR10 tutorial, which is some variation of AlexNet. To set the values of a layer, you can do
mynet.layer_nb_x.<bias or weight>.data = pytorch.tensor(....,device=same_device_as_net)
So far, I actually only checked the gradient wrt the bias of the first convolutional layer. I compare against a finite-difference approximation (one-directional). The results are in the corresponding jupyter notebook stored on Borgy (/mnt/home/bencrestel/pytorch/tutorials/beginner_source/blitz/cifar10_tutorial.ipynb). To my great suprise, the gradient checks rather well when using double precision, but checks very poorly when using single precision.
I was wondering whether I was doing the right thing, but I found a bug report in the pytorch repo where some user reports trouble with gradient check. That issue is answered by one of the developpers by switching to double precision. Another, more explicit and more convincing report that single precision should be avoided for gradient check was found in Stanford’s online course on CNN for visual recognition. In the gradient checks, they warn the readers to use double precision. So it seems well accepted that in single precision, the gradient will not check. Now on the bright side, looking at the results in single precision, it seems the problem comes from the finite-difference check, not the analytical gradient, which sorts of make sense.
no_grad
You sometimes see code where they wrap some code, often layers or value updates, inside a with statement
w1 = torch.tensor(...)
with torch.no_grad():
w1 -= learning_rate * grad_w1
...
It is necessary to tell torch to not track the layer update in the next
backpropagation step. However, you do not need to do that if you access the
weights of a layer through the data
method,
net.layer1.weights.data = ...
Indeed, as explained in the comments of that
code,
tensor.data
gives a tensor that shares the storage with tensor, but doesn’t
track history.
train vs eval
You need to specify in what mode the network is, either train
(when training)
or eval
(when you’re done with training and want to use the network). This is
used, for instance, by some types of layers like BatchNorm
or Dropout
, that
behave differently in training and evaluation phases.
Using CUDA
You need to explicitly transfer your data structures (neural net,…) onto the
GPU, using the .to()
command. To transfer the neural net, you do
gpu0 = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
net.to(gpu0)
You also need to transfer the input data onto the GPU if you want to do that, e.g.,
inputs, labels = inputs.to(gpu0), labels.to(gpu0)
Note that when applied to tensors .to(gpu0)
creates a copy (unlike when applied to
the model).
This is described at the end of the CIFAR10
tutorial.
Another good reference for all CUDA matters is the documentation.
A first interesting observation with that CIFAR10 dataset is that for a given number of epochs, smaller mini-batches (e.g., default of 4) lead to higher accuracy, but will take longer to train than, for instance with mini-batches of 64 pics. However, since each epoch is faster with larger mini-batches, a more fair comparision should be for a fixed run time, in which case it seems larger mini-batches win (for that specific application).
Using more than 1 GPU
Now by default, pytorch will only use 1 GPU. If you want to use more than 1 GPU, you need to use DataParallel.
[ML
deeplearning
pytorch
]