# Generative AI with large language models

Like with [Generative pre-trained transformers](/technical-courses/generative-pre-trained-transformers.md), these notes have been written retrospectively based on what I recorded in Anki, so may not cover all topics in the course. While still a relatively short and high-level course, it felt less shallow than the GPT one, with much more detail on the models and how to train or fine-tune them. I created 149 Anki cards for this one compared to 26 for that, suggesting that it had about six times as much content that I felt worth remembering.

* Transformers were introduced in the 2017 paper "Attention is all you need"
* The basic idea is that they learn to pay more attention to some tokens than others&#x20;
* An attention map can be used to visualize the attention paid to each token
* As an alternative to retraining the model, you can provide examples in the context window
  * This is known as in-context learning
  * One-shot learning involves a single example
  * Few-shot learning uses several examples
  * Chain-of-thought prompting gives examples of reasoning through the question
* Tokens are sampled based on the predicted probability
  * Top-k sampling restricts to the k tokens with highest probability
  * Top-p sampling restricts to the tokens whose combined probability is at most p
  * A temperature parameter can be used to control how likely the model is to predict a low-probability token
  * This is a scaling factor that is applied in the final softmax layer of the model
* The Chinchilla paper found that the compute-optimal balance between training data and model size is given by a training dataset roughly 20 times larger than the number of parameters
  * This is taken as fact in the course, but I've subsequently heard the paper questioned (though I've not dug into it myself to see whether I agree with the criticism)
* Fine-tuning can lead to catastrophic forgetting, where performance deteriorates on other tasks
  * Consider fine-tuning on multiple tasks at once
  * Parameter-efficient fine-tuning may also help here as it limits changes to the original weights
* Approaches to parameter-efficient fine-tuning
  * Additive methods freeze the base model and introduce new trainable components
    * Adapter methods add new layers
    * Soft-prompt methods learn sequences that are prepended to the input sequence
      * Unlike normal tokens, these can take any value in the embedding space
      * The soft prompts can be switched to different ones at inference time so that the same base model can be used for multiple tasks
  * Reparameterization methods create low-rank transformations of the original weights
    * LoRA (low-rank adaptation) involves replacing the original weight matrix $$W$$ with $$W+BA$$, where $$A$$ and $$B$$ are low-rank matrices
      * This is the most common approach
      * The original weights are frozen, with only the new matrices being learned
      * It's often sufficient to apply LoRA to the self-attention layers alone
      * The rank-decomposition matrices are small, so (like with soft prompts) different ones can be used for different tasks
      * LoRA can be combined with quantization to give QLoRA
* You generally want models to confirm to HHH: helpful, honest, and harmless
* To control behaviour, you can apply reinforcement learning through human feedback (RLHF)
  * In practice, rewards would be determined by a reward model trained from human-labelled data
  * One approach is to use proximal policy optimization (PPO)
  * Kullback-Leibler divergence can be used to ensure that the new model doesn't drift too far from the original
  * Constitutional AI uses self-critique guided by a set of principles in order to develop a reward model
    * This is sometimes know as reinforcement learning through AI feedback
* Reducing the size of the model
  * Quantization (reducing the precision of the weights)
    * Quantization-aware training (if it's a new model)
    * Post-training quantization (if the model has already been trained)
    * Has a greater performance impact when applied to some components than others, so you would generally apply different levels of quantization to different parts of the model
  * Pruning weights
    * Less useful if few weights are close to zero
  * Distilling from a teacher model to a student model
* Retrieval-augmented generation combines LLMs with search by adding search results to the context window
  * Useful for information not in the original training data (e.g., proprietary data or things that happened after the training cut-off)
  * Allows the model to provide sources
  * The search is often based on similarity search in an embedding space, but there's no reason why it can't be based on keyword search or a hybrid approach
* Program-aided language models combine the LLM with a code interpreter
  * Chain-of-thought prompting is used to get it to generate scripts that are then run


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://www.raoulharris.com/technical-courses/generative-ai-with-large-language-models.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
