Just helping myself keep track of LLM papers that I‘m reading, with an emphasis on inference and model compression.
Transformer Architectures
- Attention Is All You Need
- Fast Transformer Decoding: One Write-Head is All You Need - Multi-Query Attention
- Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
- Augmenting Self-attention with Persistent Memory (Meta 2019)
- MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers (Meta 2023)
- Hyena Hierarchy: Towards Larger Convolutional Language Models
Foundation Models
- LLaMA: Open and Efficient Foundation Language Models
- PaLM: Scaling Language Modeling with Pathways
- GPT-NeoX-20B: An Open-Source Autoregressive Language Model
- Language Models are Unsupervised Multitask Learners (OpenAI) - GPT-2
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
- OpenLLaMA: An Open Reproduction of LLaMA
- Llama 2: Open Foundation and Fine-Tuned Chat Models
- DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Position Encoding
- Self-Attention with Relative Position Representations
- RoFormer: Enhanced Transformer with Rotary Position Embedding - RoPE
- Transformer Language Models without Positional Encodings Still Learn Positional Information - NoPE
- Rectified Rotary Position Embeddings - ReRoPE
KV Cache
- H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (Jun. 2023)
- vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Activation
- Searching for Activation Functions
- GLU Variants Improve Transformer
- PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Pruning
- Optimal Brain Damage (1990)
- Optimal Brain Surgeon (1993)
- Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning (Jan. 2023) - Introduces Optimal Brain Quantization based on the Optimal Brain Surgeon
- Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon
- SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
- A Simple and Effective Pruning Approach for Large Language Models - Introduces Wanda (pruning with Weights and Activations)
Quantization
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale - Quantization with outlier handling. Might be solving the wrong problem - see "Quantizable Transformers" below.
- SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models - Another approach to quantization with outliers
- Up or Down? Adaptive Rounding for Post-Training Quantization (Qualcomm 2020) - Introduces AdaRound
- Understanding and Overcoming the Challenges of Efficient Transformer Quantization (Qualcomm 2021)
- QuIP: 2-Bit Quantization of Large Language Models With Guarantees (Cornell Jul. 2023) - Introduces incoherence processing
- SqueezeLLM: Dense-and-Sparse Quantization (Berkeley Jun. 2023)
- Intriguing Properties of Quantization at Scale (Cohere May 2023)
- Pruning vs Quantization: Which is Better? (Qualcomm Jul. 2023)
Normalization
- Root Mean Square Layer Normalization
- Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing - Introduces gated attention and argues that outliers are a consequence of normalization
Sparsity and rank compression
- Compressing Pre-trained Language Models by Decomposition - vanilla SVD composition to reduce matrix sizes
- Language model compression with weighted low-rank factorization - Fisher information-weighted SVD
- Numerical Optimizations for Weighted Low-rank Estimation on Language Model - Iterative implementation for the above
- Weighted Low-Rank Approximation (2003)
- Transformers learn through gradual rank increase
- Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models
- Scatterbrain: Unifying Sparse and Low-rank Attention Approximation
- LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
- LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression
- KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation
- TRP: Trained Rank Pruning for Efficient Deep Neural Networks - Introduces energy-pruning ratio
Fine-tuning
- LoRA: Low-Rank Adaptation of Large Language Models
- QLoRA: Efficient Finetuning of Quantized LLMs
- DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation - works over a range of ranks
- Full Parameter Fine-tuning for Large Language Models with Limited Resources
Sampling
- Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity
- Stay on topic with Classifier-Free Guidance
Scaling
- Efficiently Scaling Transformer Inference (Google Nov. 2022) - Pipeline and tensor parallelization for inference
- Megatron-LM (Nvidia Mar. 2020) - Intra-layer parallelism for training
Mixture of Experts
- Adaptive Mixtures of Local Experts (1991, remastered PDF)
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Google 2017)
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (Google 2022)
- Go Wider Instead of Deeper
Watermarking
More