Literature on Deep Learning

14 Jan 2019

Regularization

In Regularization for Deep Learning: A Taxonomy, the authors list and classify a large number of regularization techniques for DNN (as of October 2017).

One way Deep Learning avoids over-fitting is by building sparsity. The former can be achieved with $L_1$ regularization, or with a post hoc pruning, i.e., removing some neurons from the network after training is complete. This pruning can be completely random, but most likely will be targeted provided some sort of metric (magnitude of weights, gradient,…). In Targeted Dropout, the authors observe that dropout also promote sparity (in the sense of small numbers of high activations; see here), and they propose to dropout, with a higher probability, neurons that would be pruned in the post-training stage.

Dropout typically doesn’t work as well for CNN. In DropBlock: A regularization method for convolutional networks , the authors postulate that this is due to the spatial correlation between neurons in a CNN, and propose to drop units in a spatially correlated manner (DropBlock). They report better results.

A similar issue applied to LSTM-based networks, for which (traditional) dropout doesn’t work. In Recurrent neural network regularization, the authors introduce a modified way of applying dropout to network with LSTM cells; the key is to apply dropout only to non-recurrent connections, i.e., connections between different layers of LSTM cells.

Optimization

In Adaptive Methods for Nonconvex Optimization, the authors study convergence properties of scaled gradient-based methods, and highlight the benefit of gradually increasing the mini-batch size during training.

Network understanding

In Implicit Self-Regularization in Deep Neural Networks, the authors try to understand why DNN work so well and do not overfit by applying ranomd matrix theory to the eigenstructure of the last 2 layers of wide range of (fully connected) popular networks. Their findings include:

DNN are self-regularizing
self-regularize in different ways depending on whether it’s an old network (Tikhonov-like) or a more modern architecture (heaviy-tailed self-regularization).
connect batch size with self-regularzation properties (small batch-size are better self-regularizing).

[ ML deeplearning ]

Fourre-tout

Literature on Deep Learning

Regularization

Optimization

Network understanding

Related Posts

Webdev 101 03 Jan 2022

Goodhart's law 06 Oct 2021

Notes on the Transformer architecture 17 Sep 2021