Skip to content
This repository has been archived by the owner on Jan 27, 2023. It is now read-only.

Latest commit

 

History

History
13 lines (13 loc) · 9.49 KB

bigtable.md

File metadata and controls

13 lines (13 loc) · 9.49 KB
Model Name Model Type (Encoder-Decoder, etc.) Pre-train Objective Tokenization Vocab Size OOV Handling Embeddings Attention Activations Parameters Training Pre-Train Data Batch Size
BERT Encoder-Only
  • Masked Language Modeling (MLM) :~15% tokens chosen -> 80% replaced with [MASK], 10% random token, 10% left unchanged. A shallow decoder is used to reproduce the original text.
  • Next Sentence Prediction (NSP) : Binary classification task, predicts if 2 sequences follow each other in corpus (useful on Q&A, etc.). Sampling is 50% 0,1.
  • Training loss is mean of MLM + NSP likelihood
  • Wordpiece Tokenization : original paper, huggingface explanation
  • Token break down : [CLS] token (useful for many-to-one fine-tuned tasks such as classification) + WordPiece tokens + [SEP] token for each sentence.
  • MAX 512 Tokens.
30k tokens Greedy decomposition of token into sub-words until it finds tokens in vocabulary. Sum of: Token embeddings (WordPiece) + segment embedding (learned) + Absolute position embedding Scaled Dot-product Self-Attention (note: advised to pad inputs on right rather than left since positional embeddings are absolute.) GeLU : Dying ReLU problem - a node can be stuck @ 0 with negative inputs, stops learning, cannot recover.
  • BERT base: 12 layers (transformer blocks), 12 attention heads, hidden size = 768 -> ~110 MM params
  • BERT large: 24 layers, 16 attention heads, hidden size = 1024 -> ~340 MM params
  • Generally the parameter space choices are embed_size (E) == hidden_size (H), the feed-forward size is 4H and the number of attention heads is H/64. To see the math on how the total number of parameters is calculated, check out this comment on BERT github here
  • Adam and L_2 weight decay
  • Learning rate is warmed up during first 10k steps to peak value of 1e.-4, then linearlly decayed
  • Models are pretrained for S= 1MM updates
  • No layers frozen
  • Same learning rate throughout
Book Corpus and Wikipedia (~16 GB uncompressed) 256 batch size, maximum length 512
DistilBERT Encoder-Only (Distilled version of BERT) Triplet loss: (1) MLM + (2) Distillation + (3) Cosine-Distance (No NSP) Same as BERT Same as BERT Same as BERT Embeddings are similar to BERT, except the segment embeddings are removed Same as BERT Same as BERT 66M parameters Same as BERT (I think) Same as BERT 4096 batch size
ALBERT Encoder-Only
  • Masked Language Model (MLM) loss function (refer to BERT)
  • Sentence Order Prediction (SOP) loss
SentencePiece (as opposed to BERT's WordPiece) similar to XLNet ~30k Same as XLNet (greedy algorithm) SentencePiece embeddings Encoder-only self-attention, but with different prob(masking) GeLU (same as BERT) Albert-base (sharing attention layers): 12 layers, hidden_size=768, embed_size=128 --> 64 MM parameters
  • Fine-tuning is task specific (see table 14)
  • LAMB optimizer was used w/ LR=0.00176 @ 125k steps
Same as BERT 4096 batch size
RoBERTa Encoder-Only Masked Language Model objective with dynamic masking (see below) + No NSP or SOP (NSP removal was shown to be better) Byte-level BPE (like GPT) 50k Same as GPT? Same as BERT Same as BERT Same as BERT Model parameters are kept fixed:
  • L=12, H=768, A=12 -> 110MM parameters (+~15MM for increase in vocabulary with byte-level BPE)
They increase the pre-training steps from 100k (BERT) to up to 500k. They have a tweak on ADAM hyper-parameters. They combine 5 datasets for 160MM GB in text:
  • Book Corpus + Wikipedia
  • CC-news
  • OpenWebText
  • Stories
~2k batch size, max sequence length ~ 512 (less sometimes due to sampling technique)
BART Encoder-Decoder (Transformer) Re-construction loss: Usual decoder cross-entropy between decoder and original document (encoder sees corrupted document); although, they look at several variants:
  • GPT::Language Model
  • XLNet::Permuted Language Model
  • BERT: MLM
  • Multitask MLM
  • Masked Seq-to-seq
    • They use two-stream attention to compute likelihoods.
Same BPE encoding as GPT-2 Same as GPT? Or RoBERTa? Same as GPT? Or RoBERTa? Same as GPT? Or RoBERTa? Same as the original Transformer GeLU
  • BART contains roughly 10% more parameters than equivalent sized BERT model: 6 encoder layers, 6 decoder layers, embed_size==hidden_size=768.
  • For large-scale experiments: 12 encoder, 12 decoder, hidden_size=1024.
  • 5MM steps. Use 30% token masking, permute sentences.
  • There is a different training process for NMT (2-step process)
160 GB of data similar to Liu et al 2019 (for large scale experiments) batch_size=8K
T5 Encoder-Decoder BERT-style denoising objective: Similar to MLM, model is trained to predict missing or corrupted tokens in input. 15% of tokens are randomly sampled and dropped out. (Note: They experimented with many variants) SentencePiece 32k (across many languages w/ 10:1 English-to-non-English) Same as BERT Just token embeddings Self-attention + encoder-decoder attention (per layer) ReLU This study looks at many variants, but the base is similar to BERT_base:
  • 12 blocks (encoder + decoder)
  • hidden_size == embed_size = 768
  • FFN_dim=3072 (4*hidden)
Utimately, about twice the size of BERT --> 220MM params.
  • Pre-training: 2^19 steps for pre-training.
  • Use adaFactor optimization with inverse square root LR scheduler
  • Greedy decoding at test time
  • Fine-tuning: 2^18 steps always with same batch_size dimensions, LR=0.001, 5k checkpoints and report results for highest validation performance.
Common Crawl's C4 data (20 TB) T=512, batch= 128 with packing such that each batch is approximately 65k tokens (much smaller than other studies)
Adapter-BERT Encoder-Only Same as BERT (only fine-tuning is happening in this paper) Same as BERT Same as BERT Same as BERT Same as BERT Same as BERT Same as BERT pre-trained BERT + Adapter Layers Fine-tuning procedure: ADAM with LR increased during first 10%, followed by decaying to 0 during last 90%. LR is swept ans well as adapter size. Same as BERT Batch size = 32
ByT5 Encoder-Decoder Similar to Span Corruption (T5's pre-training objective) Tokenless! (uses UTF-8 bytes) 256 byte values + 3 special tokens There is still an OOV token, but is not used Only 256 token embeddings, no positional embeddings Self-Attention + Encoder-Decoder Attention ReLU? Model sizes were made to match mT5 (small, base, large, XL, XXL); to compensate, increased depth of encoder ("heavy encoder") and dim_model, dim_ffn All hyper-parameters are the same as mT5, except now:
  • Sequence length: 1024 tokens/bytes
  • 1MM steps
  • batch_size = 2^30 tokens
Same mC4 as mT5 model T=1024 tokens, batch_size = 2^30 tokens
CLIP Encoder-Only (2 Transformer-based Encoders: Text + Image) Cross-Entropy loss (to minimize cosine similarity a la constrastive learning) For Text Encoder: BPE For Text Encoder: ~50k Same as GPT Multi-modal embeddings combining text,image features Used in both text and image encoders differently linear projections to embedding In base Text Encoder, 63M Adam optimizer with weight decay, Cosine scheduler,learnable temperature WebImageText dataset, 400MM (text,image) pairs 32k
DALL-E Decoder-Only (read about attention) no pre-training/fine-tuning per se BPE for text + pixel tokens text=16,384; image=8,192 greedy? token 3 types of attention: text-to-text (causal), text-to-image, image-to-image attention GeLU? Up to 12 BN! training broken out into 2 steps: 1. dVAE (gumbel-softmax relaxation) 2. transformer (cross-entropy) with 16-bit precision Conceptual Captions + proprietary dataset per-gpu=8, total_batch=512
Codex Decoder-Only (GPT) The usual "causal" GPT decoding problem is presented BPE ~50k + white space tokens GPT-3 construction GPT-3 construction GPT-3 construction GPT-3 construction Large model was 12 B parameters Training was similar as GPT: 175 step linear warm up, cosine learning rate decay. Training lasted 100B tokens using Adam Opimitizer with weight decay. 54 million open repositories on Github were scraped. After a number of filters, the final dataset was 159 GB. ?