Generative AI with large language models
https://www.coursera.org/learn/generative-ai-with-llms
Last updated
https://www.coursera.org/learn/generative-ai-with-llms
Last updated
Like with Generative pre-trained transformers, these notes have been written retrospectively based on what I recorded in Anki, so may not cover all topics in the course. While still a relatively short and high-level course, it felt less shallow than the GPT one, with much more detail on the models and how to train or fine-tune them. I created 149 Anki cards for this one compared to 26 for that, suggesting that it had about six times as much content that I felt worth remembering.
Transformers were introduced in the 2017 paper "Attention is all you need"
The basic idea is that they learn to pay more attention to some tokens than others
An attention map can be used to visualize the attention paid to each token
As an alternative to retraining the model, you can provide examples in the context window
This is known as in-context learning
One-shot learning involves a single example
Few-shot learning uses several examples
Chain-of-thought prompting gives examples of reasoning through the question
Tokens are sampled based on the predicted probability
Top-k sampling restricts to the k tokens with highest probability
Top-p sampling restricts to the tokens whose combined probability is at most p
A temperature parameter can be used to control how likely the model is to predict a low-probability token
This is a scaling factor that is applied in the final softmax layer of the model
The Chinchilla paper found that the compute-optimal balance between training data and model size is given by a training dataset roughly 20 times larger than the number of parameters
This is taken as fact in the course, but I've subsequently heard the paper questioned (though I've not dug into it myself to see whether I agree with the criticism)
Fine-tuning can lead to catastrophic forgetting, where performance deteriorates on other tasks
Consider fine-tuning on multiple tasks at once
Parameter-efficient fine-tuning may also help here as it limits changes to the original weights
Approaches to parameter-efficient fine-tuning
Additive methods freeze the base model and introduce new trainable components
Adapter methods add new layers
Soft-prompt methods learn sequences that are prepended to the input sequence
Unlike normal tokens, these can take any value in the embedding space
The soft prompts can be switched to different ones at inference time so that the same base model can be used for multiple tasks
Reparameterization methods create low-rank transformations of the original weights
LoRA (low-rank adaptation) involves replacing the original weight matrix with , where and are low-rank matrices
This is the most common approach
The original weights are frozen, with only the new matrices being learned
It's often sufficient to apply LoRA to the self-attention layers alone
The rank-decomposition matrices are small, so (like with soft prompts) different ones can be used for different tasks
LoRA can be combined with quantization to give QLoRA
You generally want models to confirm to HHH: helpful, honest, and harmless
To control behaviour, you can apply reinforcement learning through human feedback (RLHF)
In practice, rewards would be determined by a reward model trained from human-labelled data
One approach is to use proximal policy optimization (PPO)
Kullback-Leibler divergence can be used to ensure that the new model doesn't drift too far from the original
Constitutional AI uses self-critique guided by a set of principles in order to develop a reward model
This is sometimes know as reinforcement learning through AI feedback
Reducing the size of the model
Quantization (reducing the precision of the weights)
Quantization-aware training (if it's a new model)
Post-training quantization (if the model has already been trained)
Has a greater performance impact when applied to some components than others, so you would generally apply different levels of quantization to different parts of the model
Pruning weights
Less useful if few weights are close to zero
Distilling from a teacher model to a student model
Retrieval-augmented generation combines LLMs with search by adding search results to the context window
Useful for information not in the original training data (e.g., proprietary data or things that happened after the training cut-off)
Allows the model to provide sources
The search is often based on similarity search in an embedding space, but there's no reason why it can't be based on keyword search or a hybrid approach
Program-aided language models combine the LLM with a code interpreter
Chain-of-thought prompting is used to get it to generate scripts that are then run