Model Name | Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training | Pre-Train Data | Batch Size |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT | Encoder-Only |
|
|
30k tokens | Greedy decomposition of token into sub-words until it finds tokens in vocabulary. | Sum of: Token embeddings (WordPiece) + segment embedding (learned) + Absolute position embedding | Scaled Dot-product Self-Attention (note: advised to pad inputs on right rather than left since positional embeddings are absolute.) | GeLU : Dying ReLU problem - a node can be stuck @ 0 with negative inputs, stops learning, cannot recover. |
|
|
Book Corpus and Wikipedia (~16 GB uncompressed) | 256 batch size, maximum length 512 |
DistilBERT | Encoder-Only (Distilled version of BERT) | Triplet loss: (1) MLM + (2) Distillation + (3) Cosine-Distance (No NSP) | Same as BERT | Same as BERT | Same as BERT | Embeddings are similar to BERT, except the segment embeddings are removed | Same as BERT | Same as BERT | 66M parameters | Same as BERT (I think) | Same as BERT | 4096 batch size |
ALBERT | Encoder-Only |
|
SentencePiece (as opposed to BERT's WordPiece) similar to XLNet | ~30k | Same as XLNet (greedy algorithm) | SentencePiece embeddings | Encoder-only self-attention, but with different prob(masking) | GeLU (same as BERT) | Albert-base (sharing attention layers): 12 layers, hidden_size=768, embed_size=128 --> 64 MM parameters |
|
Same as BERT | 4096 batch size |
RoBERTa | Encoder-Only | Masked Language Model objective with dynamic masking (see below) + No NSP or SOP (NSP removal was shown to be better) | Byte-level BPE (like GPT) | 50k | Same as GPT? | Same as BERT | Same as BERT | Same as BERT | Model parameters are kept fixed:
|
They increase the pre-training steps from 100k (BERT) to up to 500k. They have a tweak on ADAM hyper-parameters. | They combine 5 datasets for 160MM GB in text:
|
~2k batch size, max sequence length ~ 512 (less sometimes due to sampling technique) |
BART | Encoder-Decoder (Transformer) | Re-construction loss: Usual decoder cross-entropy between decoder and original document (encoder sees corrupted document); although, they look at several variants:
|
Same BPE encoding as GPT-2 | Same as GPT? Or RoBERTa? | Same as GPT? Or RoBERTa? | Same as GPT? Or RoBERTa? | Same as the original Transformer | GeLU |
|
|
160 GB of data similar to Liu et al 2019 | (for large scale experiments) batch_size=8K |
T5 | Encoder-Decoder | BERT-style denoising objective: Similar to MLM, model is trained to predict missing or corrupted tokens in input. 15% of tokens are randomly sampled and dropped out. (Note: They experimented with many variants) | SentencePiece | 32k (across many languages w/ 10:1 English-to-non-English) | Same as BERT | Just token embeddings | Self-attention + encoder-decoder attention (per layer) | ReLU | This study looks at many variants, but the base is similar to BERT_base:
|
|
Common Crawl's C4 data (20 TB) | T=512, batch= 128 with packing such that each batch is approximately 65k tokens (much smaller than other studies) |
Adapter-BERT | Encoder-Only | Same as BERT (only fine-tuning is happening in this paper) | Same as BERT | Same as BERT | Same as BERT | Same as BERT | Same as BERT | Same as BERT | pre-trained BERT + Adapter Layers | Fine-tuning procedure: ADAM with LR increased during first 10%, followed by decaying to 0 during last 90%. LR is swept ans well as adapter size. | Same as BERT | Batch size = 32 |
ByT5 | Encoder-Decoder | Similar to Span Corruption (T5's pre-training objective) | Tokenless! (uses UTF-8 bytes) | 256 byte values + 3 special tokens | There is still an OOV token, but is not used | Only 256 token embeddings, no positional embeddings | Self-Attention + Encoder-Decoder Attention | ReLU? | Model sizes were made to match mT5 (small, base, large, XL, XXL); to compensate, increased depth of encoder ("heavy encoder") and dim_model, dim_ffn | All hyper-parameters are the same as mT5, except now:
|
Same mC4 as mT5 model | T=1024 tokens, batch_size = 2^30 tokens |
CLIP | Encoder-Only (2 Transformer-based Encoders: Text + Image) | Cross-Entropy loss (to minimize cosine similarity a la constrastive learning) | For Text Encoder: BPE | For Text Encoder: ~50k | Same as GPT | Multi-modal embeddings combining text,image features | Used in both text and image encoders differently | linear projections to embedding | In base Text Encoder, 63M | Adam optimizer with weight decay, Cosine scheduler,learnable temperature | WebImageText dataset, 400MM (text,image) pairs | 32k |
DALL-E | Decoder-Only (read about attention) | no pre-training/fine-tuning per se | BPE for text + pixel tokens | text=16,384; image=8,192 | greedy? | token | 3 types of attention: text-to-text (causal), text-to-image, image-to-image attention | GeLU? | Up to 12 BN! | training broken out into 2 steps: 1. dVAE (gumbel-softmax relaxation) 2. transformer (cross-entropy) with 16-bit precision | Conceptual Captions + proprietary dataset | per-gpu=8, total_batch=512 |
Codex | Decoder-Only (GPT) | The usual "causal" GPT decoding problem is presented | BPE | ~50k + white space tokens | GPT-3 construction | GPT-3 construction | GPT-3 construction | GPT-3 construction | Large model was 12 B parameters | Training was similar as GPT: 175 step linear warm up, cosine learning rate decay. Training lasted 100B tokens using Adam Opimitizer with weight decay. | 54 million open repositories on Github were scraped. After a number of filters, the final dataset was 159 GB. | ? |
This repository has been archived by the owner on Jan 27, 2023. It is now read-only.