Generative AI with large language models

https://www.coursera.org/learn/generative-ai-with-llms

Like with Generative pre-trained transformers, these notes have been written retrospectively based on what I recorded in Anki, so may not cover all topics in the course. While still a relatively short and high-level course, it felt less shallow than the GPT one, with much more detail on the models and how to train or fine-tune them. I created 149 Anki cards for this one compared to 26 for that, suggesting that it had about six times as much content that I felt worth remembering.

Transformers were introduced in the 2017 paper "Attention is all you need"
The basic idea is that they learn to pay more attention to some tokens than others
An attention map can be used to visualize the attention paid to each token
As an alternative to retraining the model, you can provide examples in the context window
- This is known as in-context learning
- One-shot learning involves a single example
- Few-shot learning uses several examples
- Chain-of-thought prompting gives examples of reasoning through the question
Tokens are sampled based on the predicted probability
- Top-k sampling restricts to the k tokens with highest probability
- Top-p sampling restricts to the tokens whose combined probability is at most p
- A temperature parameter can be used to control how likely the model is to predict a low-probability token
- This is a scaling factor that is applied in the final softmax layer of the model
The Chinchilla paper found that the compute-optimal balance between training data and model size is given by a training dataset roughly 20 times larger than the number of parameters
- This is taken as fact in the course, but I've subsequently heard the paper questioned (though I've not dug into it myself to see whether I agree with the criticism)
Fine-tuning can lead to catastrophic forgetting, where performance deteriorates on other tasks
- Consider fine-tuning on multiple tasks at once
- Parameter-efficient fine-tuning may also help here as it limits changes to the original weights
Approaches to parameter-efficient fine-tuning
- Additive methods freeze the base model and introduce new trainable components
  - Adapter methods add new layers
  - Soft-prompt methods learn sequences that are prepended to the input sequence
    Unlike normal tokens, these can take any value in the embedding space
    The soft prompts can be switched to different ones at inference time so that the same base model can be used for multiple tasks
- Reparameterization methods create low-rank transformations of the original weights
  - LoRA (low-rank adaptation) involves replacing the original weight matrix $W$ with $W+BA$ , where $A$ and $B$ are low-rank matrices
    This is the most common approach
    The original weights are frozen, with only the new matrices being learned
    It's often sufficient to apply LoRA to the self-attention layers alone
    The rank-decomposition matrices are small, so (like with soft prompts) different ones can be used for different tasks
    LoRA can be combined with quantization to give QLoRA
You generally want models to confirm to HHH: helpful, honest, and harmless
To control behaviour, you can apply reinforcement learning through human feedback (RLHF)
- In practice, rewards would be determined by a reward model trained from human-labelled data
- One approach is to use proximal policy optimization (PPO)
- Kullback-Leibler divergence can be used to ensure that the new model doesn't drift too far from the original
- Constitutional AI uses self-critique guided by a set of principles in order to develop a reward model
  - This is sometimes know as reinforcement learning through AI feedback
Reducing the size of the model
- Quantization (reducing the precision of the weights)
  - Quantization-aware training (if it's a new model)
  - Post-training quantization (if the model has already been trained)
  - Has a greater performance impact when applied to some components than others, so you would generally apply different levels of quantization to different parts of the model
- Pruning weights
  - Less useful if few weights are close to zero
- Distilling from a teacher model to a student model
Retrieval-augmented generation combines LLMs with search by adding search results to the context window
- Useful for information not in the original training data (e.g., proprietary data or things that happened after the training cut-off)
- Allows the model to provide sources
- The search is often based on similarity search in an embedding space, but there's no reason why it can't be based on keyword search or a hybrid approach
Program-aided language models combine the LLM with a code interpreter
- Chain-of-thought prompting is used to get it to generate scripts that are then run

Last updated 8 months ago