Model Name	Model Type (Encoder-Decoder, etc.)	Pre-train Objective	Tokenization	Vocab Size	OOV Handling	Embeddings	Attention	Activations	Parameters	Training	Pre-Train Data	Batch Size
BERT	Encoder-Only	Masked Language Modeling (MLM) :~15% tokens chosen -> 80% replaced with [MASK], 10% random token, 10% left unchanged. A shallow decoder is used to reproduce the original text. Next Sentence Prediction (NSP) : Binary classification task, predicts if 2 sequences follow each other in corpus (useful on Q&A, etc.). Sampling is 50% 0,1. Training loss is mean of MLM + NSP likelihood	Wordpiece Tokenization : original paper, huggingface explanation Token break down : [CLS] token (useful for many-to-one fine-tuned tasks such as classification) + WordPiece tokens + [SEP] token for each sentence. MAX 512 Tokens.	30k tokens	Greedy decomposition of token into sub-words until it finds tokens in vocabulary.	Sum of: Token embeddings (WordPiece) + segment embedding (learned) + Absolute position embedding	Scaled Dot-product Self-Attention (note: advised to pad inputs on right rather than left since positional embeddings are absolute.)	GeLU : Dying ReLU problem - a node can be stuck @ 0 with negative inputs, stops learning, cannot recover.	BERT base: 12 layers (transformer blocks), 12 attention heads, hidden size = 768 -> ~110 MM params BERT large: 24 layers, 16 attention heads, hidden size = 1024 -> ~340 MM params Generally the parameter space choices are `embed_size (E) == hidden_size (H)`, the feed-forward size is `4H` and the number of attention heads is `H/64`. To see the math on how the total number of parameters is calculated, check out this comment on BERT github here	Adam and L_2 weight decay Learning rate is warmed up during first 10k steps to peak value of 1e.-4, then linearlly decayed Models are pretrained for S= 1MM updates No layers frozen Same learning rate throughout	Book Corpus and Wikipedia (~16 GB uncompressed)	256 batch size, maximum length 512
DistilBERT	Encoder-Only (Distilled version of BERT)	Triplet loss: (1) MLM + (2) Distillation + (3) Cosine-Distance (No NSP)	Same as BERT	Same as BERT	Same as BERT	Embeddings are similar to BERT, except the segment embeddings are removed	Same as BERT	Same as BERT	66M parameters	Same as BERT (I think)	Same as BERT	4096 batch size
ALBERT	Encoder-Only	Masked Language Model (MLM) loss function (refer to BERT) Sentence Order Prediction (SOP) loss	SentencePiece (as opposed to BERT's WordPiece) similar to XLNet	~30k	Same as XLNet (greedy algorithm)	SentencePiece embeddings	Encoder-only self-attention, but with different prob(masking)	GeLU (same as BERT)	Albert-base (sharing attention layers): 12 layers, hidden_size=768, embed_size=128 --> 64 MM parameters	Fine-tuning is task specific (see table 14) LAMB optimizer was used w/ LR=0.00176 @ 125k steps	Same as BERT	4096 batch size
RoBERTa	Encoder-Only	Masked Language Model objective with dynamic masking (see below) + No NSP or SOP (NSP removal was shown to be better)	Byte-level BPE (like GPT)	50k	Same as GPT?	Same as BERT	Same as BERT	Same as BERT	Model parameters are kept fixed: L=12, H=768, A=12 -> 110MM parameters (+~15MM for increase in vocabulary with byte-level BPE)	They increase the pre-training steps from 100k (BERT) to up to 500k. They have a tweak on ADAM hyper-parameters.	They combine 5 datasets for 160MM GB in text: Book Corpus + Wikipedia CC-news OpenWebText Stories	~2k batch size, max sequence length ~ 512 (less sometimes due to sampling technique)
BART	Encoder-Decoder (Transformer)	Re-construction loss: Usual decoder cross-entropy between decoder and original document (encoder sees corrupted document); although, they look at several variants: GPT::Language Model XLNet::Permuted Language Model BERT: MLM Multitask MLM Masked Seq-to-seq They use two-stream attention to compute likelihoods.	Same BPE encoding as GPT-2	Same as GPT? Or RoBERTa?	Same as GPT? Or RoBERTa?	Same as GPT? Or RoBERTa?	Same as the original Transformer	GeLU	BART contains roughly 10% more parameters than equivalent sized BERT model: 6 encoder layers, 6 decoder layers, embed_size==hidden_size=768. For large-scale experiments: 12 encoder, 12 decoder, hidden_size=1024.	5MM steps. Use 30% token masking, permute sentences. There is a different training process for NMT (2-step process)	160 GB of data similar to Liu et al 2019	(for large scale experiments) batch_size=8K
T5	Encoder-Decoder	BERT-style denoising objective: Similar to MLM, model is trained to predict missing or corrupted tokens in input. 15% of tokens are randomly sampled and dropped out. (Note: They experimented with many variants)	SentencePiece	32k (across many languages w/ 10:1 English-to-non-English)	Same as BERT	Just token embeddings	Self-attention + encoder-decoder attention (per layer)	ReLU	This study looks at many variants, but the base is similar to BERT_base: 12 blocks (encoder + decoder) hidden_size == embed_size = 768 FFN_dim=3072 (4hidden) Utimately, about twice the size of BERT --> 220MM params*.	Pre-training: 2^19 steps for pre-training. Use adaFactor optimization with inverse square root LR scheduler Greedy decoding at test time Fine-tuning: 2^18 steps always with same batch_size dimensions, LR=0.001, 5k checkpoints and report results for highest validation performance.	Common Crawl's C4 data (20 TB)	T=512, batch= 128 with packing such that each batch is approximately 65k tokens (much smaller than other studies)
Adapter-BERT	Encoder-Only	Same as BERT (only fine-tuning is happening in this paper)	Same as BERT	Same as BERT	Same as BERT	Same as BERT	Same as BERT	Same as BERT	pre-trained BERT + Adapter Layers	Fine-tuning procedure: ADAM with LR increased during first 10%, followed by decaying to 0 during last 90%. LR is swept ans well as adapter size.	Same as BERT	Batch size = 32
ByT5	Encoder-Decoder	Similar to Span Corruption (T5's pre-training objective)	Tokenless! (uses UTF-8 bytes)	256 byte values + 3 special tokens	There is still an OOV token, but is not used	Only 256 token embeddings, no positional embeddings	Self-Attention + Encoder-Decoder Attention	ReLU?	Model sizes were made to match mT5 (small, base, large, XL, XXL); to compensate, increased depth of encoder ("heavy encoder") and dim_model, dim_ffn	All hyper-parameters are the same as mT5, except now: Sequence length: 1024 tokens/bytes 1MM steps batch_size = 2^30 tokens	Same mC4 as mT5 model	T=1024 tokens, batch_size = 2^30 tokens
CLIP	Encoder-Only (2 Transformer-based Encoders: Text + Image)	Cross-Entropy loss (to minimize cosine similarity a la constrastive learning)	For Text Encoder: BPE	For Text Encoder: ~50k	Same as GPT	Multi-modal embeddings combining text,image features	Used in both text and image encoders differently	linear projections to embedding	In base Text Encoder, 63M	Adam optimizer with weight decay, Cosine scheduler,learnable temperature	WebImageText dataset, 400MM (text,image) pairs	32k
DALL-E	Decoder-Only (read about attention)	no pre-training/fine-tuning per se	BPE for text + pixel tokens	text=16,384; image=8,192	greedy?	token	3 types of attention: text-to-text (causal), text-to-image, image-to-image attention	GeLU?	Up to 12 BN!	training broken out into 2 steps: 1. dVAE (gumbel-softmax relaxation) 2. transformer (cross-entropy) with 16-bit precision	Conceptual Captions + proprietary dataset	per-gpu=8, total_batch=512
Codex	Decoder-Only (GPT)	The usual "causal" GPT decoding problem is presented	BPE	~50k + white space tokens	GPT-3 construction	GPT-3 construction	GPT-3 construction	GPT-3 construction	Large model was 12 B parameters	Training was similar as GPT: 175 step linear warm up, cosine learning rate decay. Training lasted 100B tokens using Adam Opimitizer with weight decay.	54 million open repositories on Github were scraped. After a number of filters, the final dataset was 159 GB.	?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bigtable.md

bigtable.md

Files

bigtable.md

Latest commit

History

bigtable.md

File metadata and controls