Generative pre-trained transformers

https://www.coursera.org/learn/chatgpt

This course was clear but quite short. It spent some time on n-gram models too, which was a useful way to build up to transformers, but meant that there was less time to cover transformers in detail. Overall, I felt Generative AI with large language models was a better introduction to the topic, going into much more detail despite still being somewhat concise.

I only made Anki cards at the time rather than separate notes, so the ones below only cover the topics that stood out to me rather than everything in the course.

Causal language models predict the next word, while masked language models predict a word that has been masked (hidden), which could be anywhere in the sentence, allowing the model to use both left and right context
One option in language modelling is to use the Markov assumption, resulting in n-gram models
n-gram models are particularly vulnerable to data sparsity issues
Intrinsic evaluation: How well does the model predict hold-out data?
- Often measured using perplexity: $PP(W) = \sqrt[N]{\frac{1}{P(w_1, w_2, ..., w_N)}}$
Extrinsic evaluation: How well does the model do on specific downstream tasks?
Search approaches
- Greedy search
  - Always pick the highest probability token
  - May not result in the highest probability sequence
  - Can result in bland/uninteresting test
- Beam search
  - Maintain the $k$ highest probability sequences ( $k$ usually set somewhere between 2 and 20)

It then went on to talk about the structure of transformers and how they are trained. There was discussion about things like backpropagation, activation functions, padding, subword tokens, new word tokens, etc., but I haven't included notes here are things like Understanding deep learning go into much more detail.

Last updated 9 months ago