14 Jan 2019
GANs are a popular types of generative models that attemps to approximate a
probabilty distribution. It does so by setting up a game between 2 agents, one
that is tasked with generating samples (the generator), and one that is tasked
with deciding whether samples generated are real or fake (the discriminator).
The tyical setup uses deep nets as both generator and discriminator, and each
have different cost functions. They are trained by alternating optimization
steps on each.
One criticism is that the distribution learned by GANs have a very-low support,
and therefore only learn a low-dimensional subset of the target distribution. This is
discussed in Do GANs actually learn the distribution? An empirical
study.
There exists different architecture for GANs. In Are GANs Created Equal? A
Large-Scale
Study,
the authors compare different GANs and find no consistent, meaningful difference
among those. They also propose some ways to compare GANs; another paper in that
direction is Assessing Generative Models via Precision and
Recall.
This
article
discusses a few workaround to some of the biggest pitfalls of GANs, namely:
- mode collapse
- non-convergence (oscillation)
- vanishing gradient (when discriminator is over-confident)
Unfortunately, I find most of the solutions to be ad-hoc solutions, like duct
tape, not really trying to address the underlying problems.
This article still has some merit, at least as a nice concise summary of the
current state of research (e.g., summarize all cost functions used in GANs).
14 Jan 2019
Regularization
In Regularization for Deep Learning: A
Taxonomy, the authors list and classify a
large number of regularization techniques for DNN (as of October 2017).
One way Deep Learning avoids over-fitting is by building sparsity.
The former can be achieved with $L_1$ regularization, or with a post hoc
pruning, i.e., removing some neurons from the network after training is
complete. This pruning can be completely random, but most likely will be
targeted provided some sort of metric (magnitude of weights, gradient,…). In
Targeted Dropout, the authors
observe that dropout also promote sparity (in the sense of small numbers of high
activations; see here),
and they propose to dropout, with a
higher probability, neurons that would be pruned in the post-training stage.
Dropout typically doesn’t work as well for CNN. In DropBlock: A regularization method for convolutional networks
,
the authors postulate that this is due to the spatial correlation between
neurons in a CNN, and propose to drop units in a spatially correlated manner
(DropBlock). They report better results.
A similar issue applied to LSTM-based networks, for which (traditional) dropout doesn’t work. In
Recurrent neural network regularization,
the authors introduce a modified way of applying dropout to network with LSTM
cells; the key is to apply dropout only to non-recurrent connections, i.e.,
connections between different layers of LSTM cells.
Optimization
In Adaptive Methods for Nonconvex
Optimization,
the authors study convergence properties of scaled gradient-based methods, and
highlight the benefit of gradually increasing the mini-batch size during
training.
Network understanding
In Implicit Self-Regularization in Deep Neural
Networks, the authors try to understand
why DNN work so well and do not overfit by applying ranomd matrix theory to the
eigenstructure of the last 2 layers of wide range of (fully connected) popular
networks. Their findings include:
- DNN are self-regularizing
- self-regularize in different ways depending on whether it’s an old network
(Tikhonov-like) or a more modern architecture (heaviy-tailed
self-regularization).
- connect batch size with self-regularzation properties (small batch-size are
better self-regularizing).
09 Jan 2019
Market impact is often defined as relating the price difference to the volume
(or POV) traded, i.e.,
\(S_T - S_0 = k v_T + \varepsilon\)
That model assumes a normal price dynamics, which may or may not make sense
depending on your time scale, but this could easily modify to assume a
log-normal price dynamics by using the difference of the log of the prices
instead.
However, another very common approach is to assume a power law for the market
impact, i.e., something like
\(S_T - S_0 = k v_T^\alpha + \varepsilon\)
Now comes the questions of fitting that model. And this is what this post is
about.
There are 2 approaches to go about that:
- you could fit directly using nonlinear regression techniqes.
- you could fit instead the log of that expression
\(\log(S_T - S_0) = \log(k) + \alpha \log(v_T) + \varepsilon\)
However, those two expressions are not equivalent, primarily because of their
assumptions on the noise distributions. This is something that is explained in
the introduction of that
paper
(note that the conclusions of that paper are heavily criticized by that other
paper).
You can estimate the change in the noise variance when applying method 2 by
using the Delta method. This is also discussed into that
stackexchange
question.
In the most relevant answer, the suggestion is to first fit using the
log-transform, then use the coefficients obtained with that method to start a
nonlinear regression solved using Newton’s method.
07 Jan 2019
I want to use this post to summarize a few important facts about using pytorch
to build and train deep neural nets.
Building a neural net
First a general note: pytorch uses
tensors as its main dataframe.
For most operations, tensors can be manipulated very much like a numpy array. But
tensors can be set to a specific device (e.g., GPU), and tensors have certain
data types that need to be understood. By default, tensors are single precision
(i.e., torch.Tensor
corresponds to torch.FloatTensor
–see here);
this is considered fine to train neural networks, but this will create some
situation when trying to test the gradient (see below).
You can find a summary of pytorch tensors basic functionalities
here.
There is also an interesting tutorial put together by FastAI which builds a
simple NN in Python, then gradually adds on pytorch capabilities to simplify and
clarify the code; this allows you to see what each component of pytorch do.
A few quick notes about pytorch:
- a trailing
_
indicates the operator is performed in-place.
You build a neural net by inheriting the torch class Module
.
Forward
At it most basic, you only need to define the method forward(self, x)
which
defines the forward propagation of your neural net. Typically, all function(al)s
you need to build your deep net are defined in the constructor, i.e. def
__init__(self)
.
To stack the layers of your neural net, you propagate the input variable
(typically x
) from one layer to the next one, and return it in the end.
Torch provides default function(al)s to define each layer (convolutional, rnn,
lstm, pooling, activation functions,…).
The main reference for all those commands is provided in the pytorch
documentation.
Below is an example that defines the AlexNet,
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# 2d convolutional layer that takes 3 input images (3-colors)
# returns 6 images,
# and apply a convolution kernel of size 5x5
self.conv1 = nn.Conv2d(3, 6, 5)
# max pooling layer which only keeps max value in 2x2 squares (subsampling)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
# bunch of linear layers (y = Wx + b)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
# apply convolution kernel,
# apply pointwise ReLU function
# and max pooling
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
# change shape to go from multiple small images,
# to one long vector
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
net = Net()
Parameters of the network
The parameters are stored in the iterator net.parameters()
. You can convert
that to a list and see that the length of the list is equal to the number of
layers x 2, since you have parameters for the weights (weight
, $W$) and the biases
(bias
, $b$), and those parameters are stored separately. Typically, the parameters are
ordered, for each layer, as weights first, bias second.
To inialize weights, you can either do it by hand (with the data
method of
the weight
or bias
components of each layer), or use one of default
functions provided in pytorch (see
here).
Loss function
First, note the typical distinction in ML between cost function and loss
function. The cost function is the average of the loss function over all
training examples (or all training examples in that batch). The loss function is
how you measure the performance of your network against the labeled data.
The loss function is defined independently from the neural net.
I think the separation is motivated by the fact that the loss function does not
have any parameters to be trained.
Pytorch comes with a large variety of loss functions, e.g., the cross-entropy,
criterion = nn.CrossEntropyLoss()
An optimizer
Same as for the loss function, you can use one of the many optimizers provided
by pytorch, e.g., stochastic gradient descent with momentum,
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
net
, here, is the convolutional neural network we defined above. So you only
need to pass the parameters of the net to the optimizer. But, these parameters
contain also the gradient information (once calculated).
Note that for some optimizers (e.g., BFGS), you need to do something more
complicated (see here).
Train the network
Basic procedure
Once we have defined a neural net, a loss function, and an optimizer, we can
start training the network. To do so, we need (1) training data, (2)
derivatives. You set up the iteration over the training data set. But once you
have a mini-batch, you need to
- compute the loss
outputs = net(data)
loss = criterion(outputs, labels)
- calculate the gradient
- apply one step of the optimizer:
Looking at the gradient
You can look at the gradient for each layer, either directly
or through the parameters (here for the bias of the first layer),
ii = 0
for x in net.parameters():
if ii == 1:
print(x.grad.data)
ii += 1
Gradient check
As an exercise, I decided to check the gradient of the neural net defined in the
pytorch’s CIFAR10
tutorial,
which is some variation of AlexNet.
To set the values of a layer, you can do
mynet.layer_nb_x.<bias or weight>.data = pytorch.tensor(....,device=same_device_as_net)
So far, I actually only checked the gradient wrt the bias of the first
convolutional layer. I compare against a finite-difference approximation
(one-directional). The results are in the corresponding jupyter notebook stored
on Borgy
(/mnt/home/bencrestel/pytorch/tutorials/beginner_source/blitz/cifar10_tutorial.ipynb).
To my great suprise, the gradient checks rather well when using double precision, but
checks very poorly when using single precision.
I was wondering whether I was
doing the right thing, but I found a bug
report in the pytorch repo
where some user reports trouble with gradient check. That issue is answered by
one of the developpers by switching to double precision.
Another, more explicit and more convincing report that single precision should
be avoided for gradient check was found in Stanford’s online course on CNN for
visual recognition. In the gradient
checks, they warn the
readers to use double precision.
So it seems well accepted that in single precision, the gradient will not check.
Now on the bright side, looking at the results in single precision, it seems the
problem comes from the finite-difference check, not the analytical gradient,
which sorts of make sense.
no_grad
You sometimes see code where they wrap some code, often layers or value updates,
inside a with statement
w1 = torch.tensor(...)
with torch.no_grad():
w1 -= learning_rate * grad_w1
...
It is necessary to tell torch to not track the layer update in the next
backpropagation step. However, you do not need to do that if you access the
weights of a layer through the data
method,
net.layer1.weights.data = ...
Indeed, as explained in the comments of that
code,
tensor.data
gives a tensor that shares the storage with tensor, but doesn’t
track history.
train vs eval
You need to specify in what mode the network is, either train
(when training)
or eval
(when you’re done with training and want to use the network). This is
used, for instance, by some types of layers like BatchNorm
or Dropout
, that
behave differently in training and evaluation phases.
Using CUDA
You need to explicitly transfer your data structures (neural net,…) onto the
GPU, using the .to()
command. To transfer the neural net, you do
gpu0 = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
net.to(gpu0)
You also need to transfer the input data onto the GPU if you want to do that,
e.g.,
inputs, labels = inputs.to(gpu0), labels.to(gpu0)
Note that when applied to tensors .to(gpu0)
creates a copy (unlike when applied to
the model).
This is described at the end of the CIFAR10
tutorial.
Another good reference for all CUDA matters is the
documentation.
A first interesting observation with that CIFAR10 dataset is that for a given
number of epochs, smaller mini-batches (e.g., default of 4) lead to higher
accuracy, but will take longer to train than, for instance with mini-batches of
64 pics. However, since each epoch is faster with larger mini-batches, a more
fair comparision should be for a fixed run time, in which case it seems larger
mini-batches win (for that specific application).
Using more than 1 GPU
Now by default, pytorch will only use 1 GPU. If you want to use more than 1 GPU,
you need to use
DataParallel.