# Understanding deep learning

There was a lot of content in this book. As of the time of writing, I've created 731 cards in Anki from it. The notes below go into much less detail (covering everything that I've created a card for would require an entire book), but I highly recommend the original book for a broad and up-to-date treatment of the topic.

See [Review: Deep learning: foundations and concepts](/technical-books/review-deep-learning-foundations-and-concepts.md) for another book that covers much of the same material. I much preferred this one, but that one provides a valuable alternative perspective and may suit some people better.

## Chapter 1: Introduction

This chapter defines supervised learning, unsupervised learning, and reinforcement learning, and gives an overview of what is covered in the rest of the book.

## Chapter 2: Supervised learning

This chapter gives more detail on supervised learning, with linear regression used as an example. It includes a brief overview of loss functions and training by gradient descent.

## Chapter 3: Shallow neural networks

The book starts getting into neural networks, with some simple examples and visualizations with ReLU activation functions. The focus is on building intuition at this stage.

The **universal approximation theorem** is discussed, which shows that for any continuous function, there exists a shallow network that can approximate it to any specified precision.

It concludes by introducing some key pieces of terminology (hidden layers, pre-activations, feed-forward networks, weights, biases, etc.). It claims that any network with at least one hidden layer is known as a multi-layer perceptron, but I would usually interpret that as restricted to fully-connected feed-forward networks.

## Chapter 4: Deep neural networks

This chapter shows how to create deep networks by composing layers. The avoidance of vector notation begins to get very messy at this stage and is thankfully abandoned later in the book.

Deep networks are **depth efficient** in that they often require far fewer parameters than shallow networks in order to approximate the same function. They also allow a more natural modelling of hierarchies.

Moderately deep networks are easier to train than shallow networks, and the trained networks tend to generalize better, though it isn't clear why.

## Chapter 5: Loss functions

A wide range of loss functions are discussed for different data types, domains, and use cases, such as Laplace distributions for robust regression, von Mises distributions for predicting directions, and Plackett-Luce distributions for permutations.

**Cross-entropy loss** is shown to be equivalent to minimizing the negative log-likelihood of the KL divergence between the empirical distribution and a model distribution.

While the focus is on probabilistic approaches, a brief mention is made of non-probabilistic approaches such as hinge loss and exponential loss.

## Chapter 6: Fitting models

This chapter starts by discussing **gradient descent** and some challenges such as **local minima** and **saddle points** (which can make it look as if convergence has been reached).

While you could exhaustively search the space or repeatedly apply gradient descent with different starting points, these approaches aren't practical for models with millions of parameters. **Stochastic** gradient descent and **mini-batch** gradient descent can escape local minima (in principle) and are less likely to spend a lot of time at saddle points. They've been found to generalize better in practice than **full-batch** gradient descent and have the bonus of being less computationally expensive.

As stochastic gradient descent doesn't converge as such, it is often applied with a **learning rate schedule**, which typically involves making larger jumps in earlier **epochs**.

**Adaptive** training algorithms apply different learning rates to different parameters based on statistics accumulated during training. As the estimates early on will be noisy, it can make sense to apply **learning rate warm-up**, where the learning rate increases initially before decreasing later.

We can add **momentum** to stochastic gradient descent by combining the gradient for the current batch with the moves based on previous batches. This tends to result in a smoother trajectory with less oscillation. **Nesterov accelerated momentum** applies the momentum before calculating the gradient rather than after.

**RMSProp** normalizes the gradient by the pointwise running average of the squared gradient, resulting in larger steps in directions where the gradient is more consistent. This reduces oscillations in directions with variable gradients while encouraging faster convergence in directions with consistent gradients. **Adaptive moment estimation** (or **Adam**) combines this idea with momentum, applying it to both the gradient and the squared gradient. Various alternatives and extensions have been proposed (such as AdamW), but Adam tends to work pretty well in practice.

## Chapter 7: Gradients and initialization

This chapter explains backpropagation, though in practice, libraries tend to handle things automatically with algorithmic differentiation, avoiding the need for hand-coding.

Poorly-initialized parameters can lead to **vanishing gradients** (if too small) or **exploding gradients** (if too big). If you are using ReLU activations then you can apply **He initialization** (which scales based on the size of the previous layer) to ensure that the values are appropriate.

## Chapter 8: Measuring performance

Error can be decomposed into **noise** (due to mislabelling, unobserved variables, or genuine randomness), **bias** (due to insufficient flexibility in the model), and **variance** (due to what data points happened to be in the training data).

Increasing model **capacity** can reduce bias, but tends to increase variance. Variance can be reduced by obtaining more data or introducing regularization.

Neural networks can sometimes display a behaviour called **double descent**, where the model improves, deteriorates, and then improves again as capacity is added. In the **under-parameterized** (or **classical**) regime, we see the expected bias-variance trade-off. Everything appears normal as we reach the stage where the model has sufficient capacity to fit the training data perfectly, but as capacity is increased further (in the **over-parameterized**, or **modern**, regime), performance begins to improve again.

This was the most surprising thing in the book to me. It's not clear why it happens, but it could be due to the additional capacity allowing smoother interpolation between the observed data points, though this requires some sort of implicit regularization.

## Chapter 9: Regularization

You can apply explicit regularization to the weights. This most commonly takes the form of **L2 regularization**, which is known in this context as **weight decay**.

Gradient descent tends to avoid areas with steep gradients, resulting in convergence to wider minima. Stochastic gradient descent generalizes better than full-batch gradient descent in practice, and larger step sizes generalize better than smaller ones. This could be due to exploring more of the landscape, but could also be due to implicit regularization.

**Early stopping** and **ensembling** can both lead to better generalization.

**Dropout** randomly zeroes some hidden units (generally up to 50%) during training, reducing the dependence on any particular hidden unit and discouraging them from co-adapting. Dropout is only applied during training, so the final weights must be decreased by multiplying by one minus the dropout probability. This is known as the **weight scaling inference rule**.

**Monte Carlo** dropout involves running the network multiple times with different units clamped to zero and then combining the results. This is similar to ensembling but with a single model.

Dropout can be thought of as applying Bernoulli noise to the activations. Noise can also be applied to the inputs, the weights, or the labels (though you can simply change the loss function using something called **label smoothing** rather than changing the labels directly). In **adversarial** training, another model deliberately constructs examples that the model will struggle to deal with.

While you could use Bayesian approaches rather than maximum likelihood, this is difficult in practice and tends to involve various approximations.

**Transfer learning** involves **pre-training** the model on some other task with abundant data. The final layers can then be replaced, with the model being **fine-tuned** for the new task. The book doesn't go into further detail here, but there are many ways to fine-tune a model. See [Generative AI with large language models](/technical-courses/generative-ai-with-large-language-models.md) for some possible approaches.

In **multi-task** learning, the same model is trained for multiple tasks concurrently.

We can apply **generative self-supervised learning** (where parts of the data are masked and the model attempts to inpaint them) or **contrastive** self-supervised learning (where the model learns to distinguish related pairs of examples from unrelated pairs).

Data **augmentation** can be particularly useful for image-related tasks.

## Chapter 10: Convolutional networks

**Invariance**: $$f(t(x))=f(x)$$

**Equivariance** (or **covariance**): $$f(t(x))=t(f(x))$$

The **convolution** operation involves sliding a fixed set of weights (the convolution **kernel** or **filter**) over the input. The key features are the **stride** (step size), **padding** (what to do at the boundary), kernel **size**, and the **channels** (number of outputs).

**Zero padding** assumes that the input is zero outside of the valid range. **Valid** convolutions only include outputs where the kernel is contained in the input range, but these have the disadvantage of decreasing the number of units relative to the previous layer.

**Dilated** (or **atrous**) kernels intersperse the weights with zeros based on a **dilation rate**.

The **receptive field** of a hidden unit is the region of the original input that feeds into it. This will grow with increasing depth in the network.

The size of subsequent layers can be decreased by increasing the stride or applying **max pooling**. This is known as **downsampling**. In general, the number of units is decreased and the number of channels increased as the depth increases.

We can **upsample** by duplicating values, applying linear interpolation, applying **max unpooling**, or using **transposed convolutions** (so called because the matrix is the transpose of the convolution matrix).

1x1 convolutions can be helpful for changing the number of channels.

The chapter concludes with a set of examples for image classification, object detection, and semantic segregation.

## Chapter 11: Residual networks

Deep networks are difficult to train, potentially due to **shattered gradients**. One way to get around this is to use **residual** (or **skip**) connections, which pass the untransformed input around a layer and add it to the output. In addition to helping with gradient flow, this results in an implicit ensemble with multiple potential paths through the network.

Residual connections can result in exploding gradients as the variance will increase with each layer. While it would be possible to apply He initialization and then scale the output by dividing by $$\sqrt{2}$$, the more common solution is to apply **batch normalization**, which standardizes the activations based on the empirical mean and standard deviation across the batch. This increases the number of parameters in the model, but this allows higher learning rates (as it makes the loss surface more predictable), adds some implicit regularization (due to the randomness inherent in the batch), and effectively makes the network less deep at the start of training (as later layers have a smaller impact on overall variation than earlier ones).

At the end of the chapter, there are some examples of residual networks such as ResNet and hourglass networks.

## Chapter 12: Transformers

**Transformers** use **dot-product self-attention** to contextualize input tokens based on other tokens in the sequence.

**Queries**, **keys**, and **values** are computed from the input embeddings. Dot products between queries and keys, passed through a softmax, determine how values are combined to produce outputs.

**Multi-head** self-attention applies multiple attention mechanisms in parallel. A full transformer layer has multi-head attention followed by a fully-connected network on each word.

**Absolute** or **relative positional encodings** are added to treat different positions differently.

Transformers start with **sub-word tokenization** (e.g., byte pair encoding) to split text into tokens from a vocabulary.

**Masked** self-attention, where future tokens are masked, enables autoregressive decoding one token at a time. **Encoder-decoder attention** connects the decoder to encoder representations.

Quadratic complexity of full self-attention limits sequence lengths. Modifications like **pruning**, **sparsity**, and **global tokens** help scale to longer sequences.

**Top-k** and **nucleus sampling** (**top-p** sampling) improve text generation quality.

Most attention heads can be pruned after training.

Self-attention is a **hypernetwork** — attention weights one network component based on another.

## Chapter 13: Graph neural networks

**Graph convolutional networks** (GCNs) update nodes by aggregating neighbours' information, inducing a relational inductive bias. They are spatial-based, contrasting with spectral-based methods.

Simple GCN layers sum neighbour embeddings at each node. Variations include **diagonal enhancement**, averaging instead of summing neighbours (useful if the embedding information is more relevant than the structural information), and **Kipf normalization** to down-weight high-degree nodes.

**Transductive** models consider labelled and unlabelled data together (semi-supervised learning), in contrast to **inductive** models, which learn from labelled data and generalize to new data.

The receptive field size in GNNs is known as the **k-hop neighborhood**. This is often too large (potentially the entire graph), a phenomenon known as the **graph expansion problem**. **Neighbourhood sampling** and **graph partitioning** address graph expansion by sampling neighbourhoods or clustering the graph.

**Graph attention** layers compute edge weights based on node data.

Deep GNNs can suffer from **suspended animation** (gradients not propagating) and **over-smoothing** (loss of local information, which can mean that increasing depth is less effective than for non-graph networks).

## Chapter 14: Unsupervised learning

There are a few extra things here, but this is a high-level comparison of a few ideas and techniques that I talked about in my notes on [Generative deep learning](/technical-books/generative-deep-learning.md).

## Chapter 15: Generative adversarial networks

While this book goes into more depth, my notes on [Generative deep learning](/technical-books/generative-deep-learning.md#chapter-4-generative-adversarial-networks) cover most of the high-level concepts that seem worth mentioning here.

In **progressive growing**, we initially train the GAN on low-resolution images before adding additional upsampling layers. The higher-resolution layers gradually "fade in" over time.

**Mini-batch** discrimination gives the discriminator access to statistics for the mini-batch. This encourages the generator to include sufficient variation in its output, preventing mode collapse.

**Auxiliary classifier GANs** and **InfoGAN** offer alternatives to the conditional GANs discussed in the other book.

## Chapter 16: Normalizing flows

As above, this is mostly just a more rigorous and in-depth version of [Generative deep learning](/technical-books/generative-deep-learning.md#chapter-6-normalizing-flow-models). The new idea that stood out the most to me was **multi-scale** flows, which gradually introduce variables rather than passing all of them through the full network.

## Chapter 17: Variational autoencoders

Again, this was mostly covered by [Generative deep learning](/technical-books/generative-deep-learning.md#chapter-3-variational-autoencoders).

## Chapter 18: Diffusion models

Another one where the topic was well covered by [Generative deep learning](/technical-books/generative-deep-learning.md#chapter-8-diffusion-models).

## Chapter 19: Reinforcement learning

A **policy** determines the **agent's action** for each **state**. **Stationary** policies depend only on the current state, while **non-stationary** policies also depend on the time step.

The **return** is the sum of cumulative discounted future **rewards** and measures the future benefit of being on a **trajectory** (an observed sequence of states, rewards, and actions). A **rollout** is a simulated trajectory. An **episode** is a trajectory from an initial to a terminal state.

The **temporal credit assignment problem** involves associating rewards with decisive actions that may occur many steps before the reward.

**Model-based** methods find the best policy from the transition matrix and reward structure using dynamic programming or by estimating them from observed MDP trajectories.

**Model-free** methods are divided into **value estimation** approaches (estimate optimal **state-action value function**) and **policy estimation** approaches (directly estimate optimal policy using gradient descent).

**Temporal difference** methods update the policy while the agent traverses the MDP, making each state's value consistent with the successor state's value using the **Bellman equation** (bootstrapping).

The **Policy Improvement Theorem** states that updating a policy to be greedy with respect to its value function creates a strictly better (or equal) policy.

**On-policy** methods learn from the current policy being followed by the agent, while **off-policy** methods learn from a policy different from the one being followed.

**Experience replay** stores past experiences (transitions) in a buffer and samples from it to break correlations in the observation sequence and smooth over changes in the data distribution.

**Q-learning** is an off-policy TD control algorithm that learns the optimal state-action value function Q(s, a) directly.

**Offline** RL (also known as **batch** RL) learns from a fixed dataset of previously collected experiences without further interaction with the environment.

## Chapter 20: Why does deep learning work?

Saving the most important/interesting stuff for last, though a lot was touched upon in earlier chapters too.

The **lottery ticket hypothesis** (Frankle & Carbin, 2019) suggests that overparameterized networks contain small trainable sub-networks (winning tickets) that are sufficient to provide the performance. This is demonstrated by pruning and retraining from the same initial weights.

The loss function along a straight line between initial parameters and final values usually decreases monotonically, with occasional small bumps near the start (Goodfellow et al., 2015b).

Real optimization trajectories lie in low-dimensional subspaces, even though they don't proceed in a straight line (Li et al., 2018b).

Good minima are not generally linearly connected, as evidenced by the pronounced increase in loss along a straight line between two independently found minima (Goodfellow et al., 2015b).

Baldi & Hornik (1989) found that the error surface of a shallow network has no local minima but only saddle points.

The **Goldilocks zone** is a range of L2 norm of weights where the curvature of the loss surface is unusually positive (Fort & Scherlis, 2019). He and Xavier initialization fall within this range.

**Grokking** (Power et al., 2022) is a phenomenon where a model trained on a small dataset suddenly transitions from underfitting to near-perfect generalization, often after many epochs of training. This suggests that the model has "grokked" the underlying concept.

## Chapter 21: Deep learning and ethics

This is an important topic, but there wasn't too much non-obvious to note here.

One thing that I found useful was the distinction between the **inner alignment problem** (getting the model to behave in line with a well-specified loss function) and the **outer alignment problem** (the gap between the loss function and our actual objectives).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://www.raoulharris.com/technical-books/understanding-deep-learning.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
