Raoul Harris
  • Introduction
  • Technical books
    • Data engineering with Alteryx
    • Deep learning in Python
    • Generative AI in action
    • Generative deep learning
    • Outlier analysis
    • Understanding deep learning
    • Understanding machine learning: from theory to algorithms (in progress)
    • Review: Deep learning: foundations and concepts
  • Technical courses
    • Advanced SQL Server masterclass for data analytics
    • Building full-stack apps with AI
    • Complete Cursor
    • DataOps methodology
    • DeepLearning.AI short courses
    • Generative AI for software development
      • Introduction to generative AI for software development
      • Team software engineering with AI
      • AI-powered software and system design
    • Generative AI with large language models
    • Generative pre-trained transformers
    • IBM DevOps and software engineering
      • Introduction to agile development and scrum
      • Introduction to cloud computing
      • Introduction to DevOps
    • Machine learning in production
    • Reinforcement learning specialization
      • Fundamentals of reinforcement learning
      • Sample-based learning methods
      • Prediction and control with function approximation
  • Non-technical books
    • Management skills for everyday life (in progress)
  • Non-technical courses
    • Business communication and effective communication specializations
      • Business writing
      • Graphic design
      • Successful presentation
      • Giving helpful feedback (not started)
      • Communicating effectively in groups (not started)
    • Illinois Tech MBA courses
      • Competitive strategy (in progress)
    • Leading people and teams specialization
      • Inspiring and motivating individuals
      • Managing talent
      • Influencing people
      • Leading teams
Powered by GitBook
On this page
  1. Technical courses

Generative AI with large language models

https://www.coursera.org/learn/generative-ai-with-llms

Last updated 8 months ago

Like with Generative pre-trained transformers, these notes have been written retrospectively based on what I recorded in Anki, so may not cover all topics in the course. While still a relatively short and high-level course, it felt less shallow than the GPT one, with much more detail on the models and how to train or fine-tune them. I created 149 Anki cards for this one compared to 26 for that, suggesting that it had about six times as much content that I felt worth remembering.

  • Transformers were introduced in the 2017 paper "Attention is all you need"

  • The basic idea is that they learn to pay more attention to some tokens than others

  • An attention map can be used to visualize the attention paid to each token

  • As an alternative to retraining the model, you can provide examples in the context window

    • This is known as in-context learning

    • One-shot learning involves a single example

    • Few-shot learning uses several examples

    • Chain-of-thought prompting gives examples of reasoning through the question

  • Tokens are sampled based on the predicted probability

    • Top-k sampling restricts to the k tokens with highest probability

    • Top-p sampling restricts to the tokens whose combined probability is at most p

    • A temperature parameter can be used to control how likely the model is to predict a low-probability token

    • This is a scaling factor that is applied in the final softmax layer of the model

  • The Chinchilla paper found that the compute-optimal balance between training data and model size is given by a training dataset roughly 20 times larger than the number of parameters

    • This is taken as fact in the course, but I've subsequently heard the paper questioned (though I've not dug into it myself to see whether I agree with the criticism)

  • Fine-tuning can lead to catastrophic forgetting, where performance deteriorates on other tasks

    • Consider fine-tuning on multiple tasks at once

    • Parameter-efficient fine-tuning may also help here as it limits changes to the original weights

  • Approaches to parameter-efficient fine-tuning

    • Additive methods freeze the base model and introduce new trainable components

      • Adapter methods add new layers

      • Soft-prompt methods learn sequences that are prepended to the input sequence

        • Unlike normal tokens, these can take any value in the embedding space

        • The soft prompts can be switched to different ones at inference time so that the same base model can be used for multiple tasks

    • Reparameterization methods create low-rank transformations of the original weights

      • LoRA (low-rank adaptation) involves replacing the original weight matrix WWW with W+BAW+BAW+BA, where AAA and BBB are low-rank matrices

        • This is the most common approach

        • The original weights are frozen, with only the new matrices being learned

        • It's often sufficient to apply LoRA to the self-attention layers alone

        • The rank-decomposition matrices are small, so (like with soft prompts) different ones can be used for different tasks

        • LoRA can be combined with quantization to give QLoRA

  • You generally want models to confirm to HHH: helpful, honest, and harmless

  • To control behaviour, you can apply reinforcement learning through human feedback (RLHF)

    • In practice, rewards would be determined by a reward model trained from human-labelled data

    • One approach is to use proximal policy optimization (PPO)

    • Kullback-Leibler divergence can be used to ensure that the new model doesn't drift too far from the original

    • Constitutional AI uses self-critique guided by a set of principles in order to develop a reward model

      • This is sometimes know as reinforcement learning through AI feedback

  • Reducing the size of the model

    • Quantization (reducing the precision of the weights)

      • Quantization-aware training (if it's a new model)

      • Post-training quantization (if the model has already been trained)

      • Has a greater performance impact when applied to some components than others, so you would generally apply different levels of quantization to different parts of the model

    • Pruning weights

      • Less useful if few weights are close to zero

    • Distilling from a teacher model to a student model

  • Retrieval-augmented generation combines LLMs with search by adding search results to the context window

    • Useful for information not in the original training data (e.g., proprietary data or things that happened after the training cut-off)

    • Allows the model to provide sources

    • The search is often based on similarity search in an embedding space, but there's no reason why it can't be based on keyword search or a hybrid approach

  • Program-aided language models combine the LLM with a code interpreter

    • Chain-of-thought prompting is used to get it to generate scripts that are then run