Generative deep learning
Foster (2023)
Part 1: Introduction to generative deep learning
Chapter 1: Generative modelling
Basic introduction to generative models and how they differ from discriminative models.
It distinguishes between explicit density models, which model the probability density function, and implicit density models (e.g., GANs), which just focus on producing a stochastic process that generates data without estimating the PDF.
Explicit models can be split into tractable-density models (which place constraints on the architecture so that the density can be calculated) and approximate-density models (which optimize an approximation).
Chapter 2: Deep learning
Brief overview of deep learning and CNNs. There's plenty of useful stuff in here, but it's covered in more detail in other books such as Understanding deep learning.
Part 2: Methods
Chapter 3: Variational autoencoders
An autoencoder consists of an encoder network that compresses the data into a latent space (or embedding space) and a decoder network that attempts to decompress this representation. For an image, you would use convolutional layers in the encoder and convolutional transpose layers in the decoder. Applications include denoising and dimensionality reduction.
To generate new images, you sample from the latent space and pass the chosen vector through the decoder.
MSE loss is the most common choice. Other loss functions are available (e.g., perceptual loss), though the book doesn't discuss many of them or spend much time on their advantages and disadvantages.
The encoder in a variational autoencoder (VAE) maps to a normal distribution (or rather the parameters of a normal distribution) rather than an individual point, which encourages continuity in the latent space and provides a well-defined distribution to sample from. The loss function is extended to include a KL-divergence term for regularization rather than just consisting of the reconstruction loss.
It's possible to perform arithmetic in the latent space. For example, "Woman with glasses" - "Woman without glasses" + "Man without glasses" should give "Man with glasses" (though results may vary in practice). You could also generate a glasses concept vector by subtracting the average image without glasses from the average image with glasses.
Chapter 4: Generative adversarial networks
In a GAN, a generator network creates synthetic data and a discriminator network tries to distinguish between the synthetic data and real data. Training alternates between the two networks so that they both improve over time. The loss may not decrease as outputs improve as each network is being measured against the latest version of the other.
While GANs can produce higher-quality images than VAEs, they are tricky to train and are vulnerable to things like mode collapse, where the generator always produces very similar images.
A Wasserstein GAN (WGAN) uses Wasserstein distance (or ) instead of Jensen-Shannon divergence, removes the sigmoid function from the output, and . (Lipschitz continuity was enforced by weight clipping in the original paper. WGAN-GP instead applies a gradient penalty.) As the discriminator's output is a score rather than a probability in a WGAN, it is usually referred to as a critic instead.
In a standard GAN, the generator's gradient can vanish if the discriminator is too good, so you need to be careful that the discriminator doesn't overpower the generator, but this is not an issue for WGANs, which also have the advantage that the loss function provides a meaningful measure of how close the generated distribution is to the real distribution.
In a conditional GAN, additional information (such as class labels) is provided to both the generator and the discriminator in order to guide the output.
Chapter 5: Autoregressive models
Autoregressive models base their predictions on previous values in a sequence.
Recurrent neural networks contain a recurrent layer (or cell) whose output at one timestep forms part of its input in the next. At timestep t, the cell receives both the input and the previous timestep's hidden state . It then computes the hidden state and passes it to the next timestep.
Vanilla RNNs are vulnerable to vanishing gradients, resulting in them failing to learn long-term dependencies. Long short-term memory networks add a cell state which acts as long-term memory to complement the shorter-term memory of the hidden state. They use a forget gate to decide what to discard from the cell state, an input gate to determine what new information to store, and an output gate to control what information goes into the hidden state.
Gated recurrent units are a simplified variant of LSTMs with no cell state that have only two types of gate rather than three.
Chapter 6: Normalizing flow models
Like VAEs, normalizing flows attempt to transform the data distribution to a simpler definition, but they impose the requirement that this mapping be invertible. This provides a bijection between the data space and the latent space, allowing efficient sampling and exact likelihood computation.
There are various types of flows, such as planar flows (affine transformations), radial flows, autoregressive flows, and coupling flows.
The book focuses on real-valued non-volume preserving functions, which use a series of coupling layers, where the input is split into two parts, with affine transformations applied to each part conditioned on the other. Masking is alternated between the layers, ensuring that all dimensions are transformed. (An alternative approach not discussed is non-linear independent components estimation, where the coupling is additive rather than affine.) Due to the nature of the transformations, the Jacobian determinant is easy to compute, allowing the likelihood to be calculated efficiently using the change of variables equation .
Chapter 7: Energy-based models
Energy-based models (EBMs) define a scalar energy function over the data space, where lower energy corresponds to higher probability. Unlike explicit density models, EBMs focus on modelling the unnormalized probability distribution, sidestepping the need to compute a partition function that normalizes the probabilities. The probability of a data point is given by the Boltzmann distribution , but calculating is often intractable due to the high dimensionality of .
To train EBMs without computing , techniques like contrastive divergence are used. This approximates the gradient of the energy function by comparing the energy of real data samples to that of negative samples generated through methods like Markov Chain Monte Carlo (MCMC) or Langevin dynamics, which is based on stochastic gradient descent on the input space, effectively generating new data points that have low energy according to the model.
Early examples of EBMs include the Boltzmann machine and the Restricted Boltzmann Machine (RBM). RBMs are two-layer networks with visible and hidden units, with connections only between layers, making them easier to train than fully connected Boltzmann machines. However, RBMs and similar models struggle with high-dimensional data due to slow in MCMC sampling.
Recent advancements, like the Noise-Conditional Score Network (NCSN), attempt to estimate the score function (the gradient of the log probability density) directly, improving sample quality and training stability.
Chapter 8: Diffusion models
Diffusion models are generative models that learn data distributions by modelling the reverse of a diffusion process. In the forward diffusion process, data is progressively noised over several steps until it approximates a simple known distribution (e.g., Gaussian noise). The model is then trained to reverse this process, learning to denoise data step by step to generate new samples from noise.
The forward process is defined as a Markov chain where small amounts of noise are added at each step. The reverse process involves learning the conditional probabilities , which are approximated using neural networks trained to predict the added noise at each step.
Training diffusion models use a variant of denoising score matching, where the loss function measures the difference between the model's predicted noise and the true noise. By training across all timesteps, the model learns to generate high-quality samples from pure noise.
Diffusion models have demonstrated impressive results in image and audio generation, rivalling GANs in quality while offering more stable training and avoiding issues like mode collapse. They are also connected to energy-based models and score-based generative models, sharing foundational principles in probability density modelling.
Part 3: Applications
Chapter 9: Transformers
Transformers revolutionized sequence modelling by removing the need for recurrent architectures. They use self-attention mechanisms to attend to all elements of the sequence, capturing both short- and long-range dependencies effectively.
Each element in the sequence computes weighted sums over all other elements, determined by learned attention weights derived from queries, keys, and values. This allows the model to focus on relevant parts of the input when generating each part of the output. The book includes some equations and diagrams of the architecture, but the detail doesn't seem worth repeating here as those things can easily be found elsewhere.
In generative modelling, Transformers are used in an autoregressive manner, predicting the next element in a sequence based on the preceding elements.
In addition to natural language processing, Transformers have been applied to image generation, music generation, and other generative tasks. Their ability to handle long sequences and parallelize computation makes them very efficient compared to RNNs.
Chapter 10: Music generation
Autoregressive models like RNNs and Transformers can process sequences of musical events, generating music note by note or chord by chord.
Challenges in music generation include handling polyphony (multiple notes played simultaneously), capturing long-term structures like motifs and themes, and ensuring musical coherence over time.
Techniques such as hierarchical modelling allow the model to understand music at multiple levels, from individual notes to entire phrases.
Chapter 11: World models
World models are generative models used in reinforcement learning to simulate environments, allowing agents to learn without relying on real-world interactions. They typically consist of:
A perception model (encoder) that compresses high-dimensional observations into a lower-dimensional latent space.
A transition model (often an RNN) that predicts future latent states based on current states and actions.
A generation model (decoder) that reconstructs observations from latent states.
Chapter 12: Multimodal models
Multimodal models handle and generate data across multiple modalities, such as text, images, and audio. Approaches include:
Conditional generative models, where one modality is used as a condition for generating another (e.g., conditioning a GAN on text embeddings).
Representation learning, where models like CLIP learn shared embeddings for images and text by training on image-text pairs, enabling cross-modal retrieval and zero-shot learning.
Comparison of the main techniques discussed
VAEs provide continuous, interpretable latent spaces but may produce blurrier images.
GANs generate sharp images but are challenging to train and can suffer from instability.
Diffusion models offer stability and high-quality outputs by modelling the data generation process as a denoising task.
Normalizing flows allow exact likelihood computation and efficient sampling through invertible transformations.
Energy-based models model unnormalized probabilities and capture rich dependencies but require sophisticated training techniques.
Last updated