Skip to content

Latest commit

 

History

History
34 lines (26 loc) · 3.18 KB

1-Transformers.md

File metadata and controls

34 lines (26 loc) · 3.18 KB

Chapter 1: Transformers

The transformer is an important neural network architecture used for language modeling.

Recommended reading

  • Attention Is All You Need - Section 3 of the paper that introduced the transformer explains the architecture. Don't worry too much about the encoder and how that fits in, as that's somewhat specific to translation – unsupervised transformer language models are generally decoder-only.

Optional reading

  • The Illustrated Transformer - A blog post explaining the architecture more carefully. Read this if you're finding the original paper hard to follow.
  • GPT-3 - A 175-billion parameter decoder-only transformer language model that exhibits impressive meta-learning capabilities.
  • The Transformer Family - An overview of many transformer variants, including Transformer-XL, Image Transformer, Sparse Transformer, Reformer and Universal Transformer.
  • T5 - A careful study of different architectural details and training objectives for transformers.
  • Mixture-of-Experts - A form of parameter sparsity used by some more recent language models to improve training efficiency. Section 2 of this paper explains how they work.

Suggested exercise

Implement a decoder-only transformer language model.

  • Here are some first principle questions to answer:
    • What is different architecturally from the Transformer, vs a normal RNN, like an LSTM? (Specifically, how are recurrence and time managed?)
    • Attention is defined as, Attention(Q,K,V) = softmax(QK^T/sqrt(d_k))V. What are the dimensions for Q, K, and V? Why do we use this setup? What other combinations could we do with (Q,K) that also output weights?
    • Are the dense layers different at each multi-head attention block? Why or why not?
    • Why do we have so many skip connections, especially connecting the input of an attention function to the output? Intuitively, what if we didn't?
  • Now we'll actually implement the code. Make sure each of these is completely correct - it's very easy to get the small details wrong.
    • Implement the positional embedding function first.
    • Then implement the function which calculates attention, given (Q,K,V) as arguments.
    • Now implement the masking function.
    • Put it all together to form an entire attention block.
    • Finish the whole architecture.
  • If you get stuck, The Annotated Transformer may help, but don't just copy-paste the code.
  • To check you have the attention mask set up correctly, train your model on a toy task, such as reversing a random sequence of tokens. The model should be able to predict the second half of the sequence, but not the first.
  • Finally, train your model on the complete works of William Shakespeare. Tokenize the corpus by splitting at word boundaries (re.split(r"\b", ...)). Make sure you don't use overlapping sequences as this can lead to overfitting.