GitHub - learning-at-home/lean_transformer: Memory-efficient transformer. Work in progress.

[Under Construction] A transformer that does not hog your GPU memory

LeanTransformer implements a specific version of transformer with two goals in mind:

using as little GPU memory as possible
stable training for very large models

This is code is under active development: if you want a stable and documented version, look at CALM or dalle-hivemind.

Basic usage: lean transformer works similarly to most models on Hugging Face Transformers. The model can be instantiated from a config, run forward and backward, compute loss. One can use vanilla general-purpose LeanTransformer or one of pre-implemented models:

from transformers import AutoTokenizer
from lean_transformer.models.gpt import LeanGPTConfig, LeanGPTModel

config = LeanGPTConfig(
    vocab_size=10 ** 4, hidden_size=768, num_hidden_layers=12, num_attention_heads=16,
    position_embedding_type="rotary", hidden_act_gated=True, tie_word_embeddings=True
)
model = LeanGPTModel(config)
tokenizer = AutoTokenizer.from_pretrained("gpt2-large")

dummy_inputs = tokenizer("A cat sat on a mat", return_tensors="pt")
outputs = model(**dummy_inputs, labels=dummy_inputs['input_ids'])
outputs.loss.backward()

All models are batch-first, i.e. they work on [batch, length, hid_size] or [batch, height, width, channels] tensors like the rest of HuggingFace stuff.

A day will come a day when we explain all these modifications and provide instructions on how to tune them. Until then, we'll happily answer any questions on our discord.

How it works?

The core philosophy of LeanTransformer is to replace torch.autograd with grad students. Automatic differentiation is great if you want to test ideas quickly, less so if a single training run can cost over $4 million (or >1000 years in grad school). So, we made a ton of tweaks that minimize memory usage.

Related work: GSO

Our implementation partially replaces automatic differentiation with Grad Student Optimization (GSO) - a biologically inspired black box optimization algorithm. In the past, GSO has seen widespread adoption thanks to its strong theoretical foundations and unparalleled cost efficiency (Chom et al). Previous successfully applied GSO for hyperparameter tuning and natural language generation. To the best of our knowledge we are the first work to successfully apply distributed fault-tolerant GSO for optimizing the memory footprint of transformers. We summarize our findings below:

Memory saving features:

[default] manual memory-efficient differentiation for feedforward layers
[option] gradient checkpointing (Griewank et al, Chen et al, 2016)
[option] reversible layers using ClashLuke's revlib, based on (Gomez et al, 2017, Kitaev et al, 2020)
[option] PixelFly block-sparse layers that significantly reduce the number of parameters (Chen et al, 2021)
[option] customizable parameter sharing (Radford et al, 2019, Xue et al, 2021)
[option] CPU-offloaded 8-bit LAMB (Dettmers et al, 2021)
A pinch of magic that we'll explain eventually (hopefully)

Other features:

[default] Pre-normalization: a more stable layer order used in GPT2 (as opposed to the original transformer)
[option] Sandwich Norm, as proposed in (Ding et al, 2021)
[option] Maintaining FP32 residuals in mixed precision training, learned from discussions with Samyam and Jeff from DeepSpeed
[option] Rotary Position Embeddings, proposed by Su et al and popularized by EleutherAI
[option] Gated activations (e.g. GeGLU) (Shazeer et al, 2020), based on (Dauphin et al, 2016)

Acknowledgements:

Most of the architecture and stability optimizations systematized by the BigScience research workshop
Hugging Face
YSDA community helped us survive through the early messy versions of this code
NeuroPark trained the first practical model (SahajBERT-XL, SoTA in bengali, details here)
LAION community helped us put together basic DALLE training
NCAI, an arabic community for training
Personal thanks to Stas Bekman, Tim Dettmers, Lucas Nestler, Samyam Rajbhandari, Deepak Narayanan, Jared Casper, Jeff Rasley, as well as all the people who contributed to the code.

Name		Name	Last commit message	Last commit date
Latest commit History 357 Commits
.github/workflows		.github/workflows
lean_transformer		lean_transformer
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[Under Construction] A transformer that does not hog your GPU memory

How it works?

Acknowledgements:

About

Releases

Packages

Contributors 10

Languages

License

learning-at-home/lean_transformer

Folders and files

Latest commit

History

Repository files navigation

[Under Construction] A transformer that does not hog your GPU memory

How it works?

Acknowledgements:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 10

Languages

Packages