Understanding deep learning

Prince (2024)

There was a lot of content in this book. As of the time of writing, I've created 731 cards in Anki from it. The notes below go into much less detail (covering everything that I've created a card for would require an entire book), but I highly recommend the original book for a broad and up-to-date treatment of the topic.

See Review: Deep learning: foundations and concepts for another book that covers much of the same material. I much preferred this one, but that one provides a valuable alternative perspective and may suit some people better.

Chapter 1: Introduction

This chapter defines supervised learning, unsupervised learning, and reinforcement learning, and gives an overview of what is covered in the rest of the book.

Chapter 2: Supervised learning

This chapter gives more detail on supervised learning, with linear regression used as an example. It includes a brief overview of loss functions and training by gradient descent.

Chapter 3: Shallow neural networks

The book starts getting into neural networks, with some simple examples and visualizations with ReLU activation functions. The focus is on building intuition at this stage.

The universal approximation theorem is discussed, which shows that for any continuous function, there exists a shallow network that can approximate it to any specified precision.

It concludes by introducing some key pieces of terminology (hidden layers, pre-activations, feed-forward networks, weights, biases, etc.). It claims that any network with at least one hidden layer is known as a multi-layer perceptron, but I would usually interpret that as restricted to fully-connected feed-forward networks.

Chapter 4: Deep neural networks

This chapter shows how to create deep networks by composing layers. The avoidance of vector notation begins to get very messy at this stage and is thankfully abandoned later in the book.

Deep networks are depth efficient in that they often require far fewer parameters than shallow networks in order to approximate the same function. They also allow a more natural modelling of hierarchies.

Moderately deep networks are easier to train than shallow networks, and the trained networks tend to generalize better, though it isn't clear why.

Chapter 5: Loss functions

A wide range of loss functions are discussed for different data types, domains, and use cases, such as Laplace distributions for robust regression, von Mises distributions for predicting directions, and Plackett-Luce distributions for permutations.

Cross-entropy loss is shown to be equivalent to minimizing the negative log-likelihood of the KL divergence between the empirical distribution and a model distribution.

While the focus is on probabilistic approaches, a brief mention is made of non-probabilistic approaches such as hinge loss and exponential loss.

Chapter 6: Fitting models

This chapter starts by discussing gradient descent and some challenges such as local minima and saddle points (which can make it look as if convergence has been reached).

While you could exhaustively search the space or repeatedly apply gradient descent with different starting points, these approaches aren't practical for models with millions of parameters. Stochastic gradient descent and mini-batch gradient descent can escape local minima (in principle) and are less likely to spend a lot of time at saddle points. They've been found to generalize better in practice than full-batch gradient descent and have the bonus of being less computationally expensive.

As stochastic gradient descent doesn't converge as such, it is often applied with a learning rate schedule, which typically involves making larger jumps in earlier epochs.

Adaptive training algorithms apply different learning rates to different parameters based on statistics accumulated during training. As the estimates early on will be noisy, it can make sense to apply learning rate warm-up, where the learning rate increases initially before decreasing later.

We can add momentum to stochastic gradient descent by combining the gradient for the current batch with the moves based on previous batches. This tends to result in a smoother trajectory with less oscillation. Nesterov accelerated momentum applies the momentum before calculating the gradient rather than after.

RMSProp normalizes the gradient by the pointwise running average of the squared gradient, resulting in larger steps in directions where the gradient is more consistent. This reduces oscillations in directions with variable gradients while encouraging faster convergence in directions with consistent gradients. Adaptive moment estimation (or Adam) combines this idea with momentum, applying it to both the gradient and the squared gradient. Various alternatives and extensions have been proposed (such as AdamW), but Adam tends to work pretty well in practice.

Chapter 7: Gradients and initialization

This chapter explains backpropagation, though in practice, libraries tend to handle things automatically with algorithmic differentiation, avoiding the need for hand-coding.

Poorly-initialized parameters can lead to vanishing gradients (if too small) or exploding gradients (if too big). If you are using ReLU activations then you can apply He initialization (which scales based on the size of the previous layer) to ensure that the values are appropriate.

Chapter 8: Measuring performance

Error can be decomposed into noise (due to mislabelling, unobserved variables, or genuine randomness), bias (due to insufficient flexibility in the model), and variance (due to what data points happened to be in the training data).

Increasing model capacity can reduce bias, but tends to increase variance. Variance can be reduced by obtaining more data or introducing regularization.

Neural networks can sometimes display a behaviour called double descent, where the model improves, deteriorates, and then improves again as capacity is added. In the under-parameterized (or classical) regime, we see the expected bias-variance trade-off. Everything appears normal as we reach the stage where the model has sufficient capacity to fit the training data perfectly, but as capacity is increased further (in the over-parameterized, or modern, regime), performance begins to improve again.

This was the most surprising thing in the book to me. It's not clear why it happens, but it could be due to the additional capacity allowing smoother interpolation between the observed data points, though this requires some sort of implicit regularization.

Chapter 9: Regularization

You can apply explicit regularization to the weights. This most commonly takes the form of L2 regularization, which is known in this context as weight decay.

Gradient descent tends to avoid areas with steep gradients, resulting in convergence to wider minima. Stochastic gradient descent generalizes better than full-batch gradient descent in practice, and larger step sizes generalize better than smaller ones. This could be due to exploring more of the landscape, but could also be due to implicit regularization.

Early stopping and ensembling can both lead to better generalization.

Dropout randomly zeroes some hidden units (generally up to 50%) during training, reducing the dependence on any particular hidden unit and discouraging them from co-adapting. Dropout is only applied during training, so the final weights must be decreased by multiplying by one minus the dropout probability. This is known as the weight scaling inference rule.

Monte Carlo dropout involves running the network multiple times with different units clamped to zero and then combining the results. This is similar to ensembling but with a single model.

Dropout can be thought of as applying Bernoulli noise to the activations. Noise can also be applied to the inputs, the weights, or the labels (though you can simply change the loss function using something called label smoothing rather than changing the labels directly). In adversarial training, another model deliberately constructs examples that the model will struggle to deal with.

While you could use Bayesian approaches rather than maximum likelihood, this is difficult in practice and tends to involve various approximations.

Transfer learning involves pre-training the model on some other task with abundant data. The final layers can then be replaced, with the model being fine-tuned for the new task. The book doesn't go into further detail here, but there are many ways to fine-tune a model. See Generative AI with large language models for some possible approaches.

In multi-task learning, the same model is trained for multiple tasks concurrently.

We can apply generative self-supervised learning (where parts of the data are masked and the model attempts to inpaint them) or contrastive self-supervised learning (where the model learns to distinguish related pairs of examples from unrelated pairs).

Data augmentation can be particularly useful for image-related tasks.

Chapter 10: Convolutional networks

Invariance: $f(t(x))=f(x)$

Equivariance (or covariance): $f(t(x))=t(f(x))$

The convolution operation involves sliding a fixed set of weights (the convolution kernel or filter) over the input. The key features are the stride (step size), padding (what to do at the boundary), kernel size, and the channels (number of outputs).

Zero padding assumes that the input is zero outside of the valid range. Valid convolutions only include outputs where the kernel is contained in the input range, but these have the disadvantage of decreasing the number of units relative to the previous layer.

Dilated (or atrous) kernels intersperse the weights with zeros based on a dilation rate.

The receptive field of a hidden unit is the region of the original input that feeds into it. This will grow with increasing depth in the network.

The size of subsequent layers can be decreased by increasing the stride or applying max pooling. This is known as downsampling. In general, the number of units is decreased and the number of channels increased as the depth increases.

We can upsample by duplicating values, applying linear interpolation, applying max unpooling, or using transposed convolutions (so called because the matrix is the transpose of the convolution matrix).

1x1 convolutions can be helpful for changing the number of channels.

The chapter concludes with a set of examples for image classification, object detection, and semantic segregation.

Chapter 11: Residual networks

Deep networks are difficult to train, potentially due to shattered gradients. One way to get around this is to use residual (or skip) connections, which pass the untransformed input around a layer and add it to the output. In addition to helping with gradient flow, this results in an implicit ensemble with multiple potential paths through the network.

Residual connections can result in exploding gradients as the variance will increase with each layer. While it would be possible to apply He initialization and then scale the output by dividing by $\sqrt{2}$ , the more common solution is to apply batch normalization, which standardizes the activations based on the empirical mean and standard deviation across the batch. This increases the number of parameters in the model, but this allows higher learning rates (as it makes the loss surface more predictable), adds some implicit regularization (due to the randomness inherent in the batch), and effectively makes the network less deep at the start of training (as later layers have a smaller impact on overall variation than earlier ones).

At the end of the chapter, there are some examples of residual networks such as ResNet and hourglass networks.

Chapter 12: Transformers

Transformers use dot-product self-attention to contextualize input tokens based on other tokens in the sequence.

Queries, keys, and values are computed from the input embeddings. Dot products between queries and keys, passed through a softmax, determine how values are combined to produce outputs.

Multi-head self-attention applies multiple attention mechanisms in parallel. A full transformer layer has multi-head attention followed by a fully-connected network on each word.

Absolute or relative positional encodings are added to treat different positions differently.

Transformers start with sub-word tokenization (e.g., byte pair encoding) to split text into tokens from a vocabulary.

Masked self-attention, where future tokens are masked, enables autoregressive decoding one token at a time. Encoder-decoder attention connects the decoder to encoder representations.

Quadratic complexity of full self-attention limits sequence lengths. Modifications like pruning, sparsity, and global tokens help scale to longer sequences.

Top-k and nucleus sampling (top-p sampling) improve text generation quality.

Most attention heads can be pruned after training.

Self-attention is a hypernetwork — attention weights one network component based on another.

Chapter 13: Graph neural networks

Graph convolutional networks (GCNs) update nodes by aggregating neighbours' information, inducing a relational inductive bias. They are spatial-based, contrasting with spectral-based methods.

Simple GCN layers sum neighbour embeddings at each node. Variations include diagonal enhancement, averaging instead of summing neighbours (useful if the embedding information is more relevant than the structural information), and Kipf normalization to down-weight high-degree nodes.

Transductive models consider labelled and unlabelled data together (semi-supervised learning), in contrast to inductive models, which learn from labelled data and generalize to new data.

The receptive field size in GNNs is known as the k-hop neighborhood. This is often too large (potentially the entire graph), a phenomenon known as the graph expansion problem. Neighbourhood sampling and graph partitioning address graph expansion by sampling neighbourhoods or clustering the graph.

Graph attention layers compute edge weights based on node data.

Deep GNNs can suffer from suspended animation (gradients not propagating) and over-smoothing (loss of local information, which can mean that increasing depth is less effective than for non-graph networks).

Chapter 14: Unsupervised learning

There are a few extra things here, but this is a high-level comparison of a few ideas and techniques that I talked about in my notes on Generative deep learning.

Chapter 15: Generative adversarial networks

While this book goes into more depth, my notes on Chapter 4: Generative adversarial networks cover most of the high-level concepts that seem worth mentioning here.

In progressive growing, we initially train the GAN on low-resolution images before adding additional upsampling layers. The higher-resolution layers gradually "fade in" over time.

Mini-batch discrimination gives the discriminator access to statistics for the mini-batch. This encourages the generator to include sufficient variation in its output, preventing mode collapse.

Auxiliary classifier GANs and InfoGAN offer alternatives to the conditional GANs discussed in the other book.

Chapter 16: Normalizing flows

As above, this is mostly just a more rigorous and in-depth version of Chapter 6: Normalizing flow models. The new idea that stood out the most to me was multi-scale flows, which gradually introduce variables rather than passing all of them through the full network.

Chapter 17: Variational autoencoders

Again, this was mostly covered by Chapter 3: Variational autoencoders.

Chapter 18: Diffusion models

Another one where the topic was well covered by Chapter 8: Diffusion models.

Chapter 19: Reinforcement learning

A policy determines the agent's action for each state. Stationary policies depend only on the current state, while non-stationary policies also depend on the time step.

The return is the sum of cumulative discounted future rewards and measures the future benefit of being on a trajectory (an observed sequence of states, rewards, and actions). A rollout is a simulated trajectory. An episode is a trajectory from an initial to a terminal state.

The temporal credit assignment problem involves associating rewards with decisive actions that may occur many steps before the reward.

Model-based methods find the best policy from the transition matrix and reward structure using dynamic programming or by estimating them from observed MDP trajectories.

Model-free methods are divided into value estimation approaches (estimate optimal state-action value function) and policy estimation approaches (directly estimate optimal policy using gradient descent).

Temporal difference methods update the policy while the agent traverses the MDP, making each state's value consistent with the successor state's value using the Bellman equation (bootstrapping).

The Policy Improvement Theorem states that updating a policy to be greedy with respect to its value function creates a strictly better (or equal) policy.

On-policy methods learn from the current policy being followed by the agent, while off-policy methods learn from a policy different from the one being followed.

Experience replay stores past experiences (transitions) in a buffer and samples from it to break correlations in the observation sequence and smooth over changes in the data distribution.

Q-learning is an off-policy TD control algorithm that learns the optimal state-action value function Q(s, a) directly.

Offline RL (also known as batch RL) learns from a fixed dataset of previously collected experiences without further interaction with the environment.

Chapter 20: Why does deep learning work?

Saving the most important/interesting stuff for last, though a lot was touched upon in earlier chapters too.

The lottery ticket hypothesis (Frankle & Carbin, 2019) suggests that overparameterized networks contain small trainable sub-networks (winning tickets) that are sufficient to provide the performance. This is demonstrated by pruning and retraining from the same initial weights.

The loss function along a straight line between initial parameters and final values usually decreases monotonically, with occasional small bumps near the start (Goodfellow et al., 2015b).

Real optimization trajectories lie in low-dimensional subspaces, even though they don't proceed in a straight line (Li et al., 2018b).

Good minima are not generally linearly connected, as evidenced by the pronounced increase in loss along a straight line between two independently found minima (Goodfellow et al., 2015b).

Baldi & Hornik (1989) found that the error surface of a shallow network has no local minima but only saddle points.

The Goldilocks zone is a range of L2 norm of weights where the curvature of the loss surface is unusually positive (Fort & Scherlis, 2019). He and Xavier initialization fall within this range.

Grokking (Power et al., 2022) is a phenomenon where a model trained on a small dataset suddenly transitions from underfitting to near-perfect generalization, often after many epochs of training. This suggests that the model has "grokked" the underlying concept.

Chapter 21: Deep learning and ethics

This is an important topic, but there wasn't too much non-obvious to note here.

One thing that I found useful was the distinction between the inner alignment problem (getting the model to behave in line with a well-specified loss function) and the outer alignment problem (the gap between the loss function and our actual objectives).

Last updated 9 months ago