Skip to content

Megatron-LM 3D parallelism for 🤗 transformers model *(still work in progress)*

License

Notifications You must be signed in to change notification settings

pterodactyl-soup/pipegoose

 
 

Repository files navigation

🚧 PipeGoose: Training any 🤗 transformers in Megatron-LM 3D parallelism out of the box

tests Code style: black Codecov Imports: isort Twitter

pipeline

Honk honk honk! This project is actively under development. Check out my learning progress here.

⚠️ The project is actively under development and not ready for use.

⚠️ The APIs is still a work in progress and could change at any time. None of the public APIs are set in stone until we hit version 0.6.9.

import torch
import torch.nn.functional as F
from transformer import AutoModel, AutoTokenizer
from datasets import load_dataset
+ from pipegoose import DataParallel, TensorParallel, PipelineParalell, ParallelContext
+ from pipegoose.optim import DistributedOptimizer

model = AutoModel.from_pretrained("bloom")
tokenizer = AutoTokenizer.from_pretrained("bloom")

- device = "cuda"
- model = model.to(device)
+ parallel_context = ParallelContext(
+    tensor_parallel_size=2,
+    data_parallel_size=2,
+    pipeline_parallel_size=2
+ )
+ model = DataParallel(model, parallel_context).parallelize()
+ model = TensorParallel(model, parallel_context).parallelize()
+ model = PipelineParallel(model, parallel_context).parallelize()

optimizer = torch.optim.Adam(model.parameters())
+ optimizer = DistributedOptimizer(optimizer, parallel_context)

dataset = load_dataset('goose')
dataloader = torch.utils.data.DataLoader(dataset, batch_size=42)

for epoch in range(69):
    for inputs, targets in dataloader:
-         inputs = inputs.to(device)
-         targets = targets.to(device)

        output = model(inputs)
        loss = F.cross_entropy(output, targets)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Implementation Details

  • Supports training transformers model in Megatron 3D parallelism and ZeRO-1 (write from scratch).
  • Implements parallel compute and data transfer using separate CUDA streams.
  • Gradient checkpointing will be implemented by enforcing virtual dependency in the backpropagation graph, ensuring that the activation for gradient checkpoint will be recomputed just in time for each (micro-batch, partition).
  • Custom algorithms for model partitioning with two default partitioning models based on elapsed time and GPU memory consumption per layer.
  • Potential support includes:
    • Callbacks within the pipeline: Callback(function, microbatch_idx, partition_idx) for before and after the forward, backward, and recompute steps (for gradient checkpointing).
    • Mixed precision training.

Appreciation

  • Big thanks to 🤗 Hugging Face for sponsoring this project with 8x A100 GPUs for testing! And Zach Schrier for monthly twitch donations

  • The library's APIs are inspired by OSLO's and ColossalAI's APIs.

About

Megatron-LM 3D parallelism for 🤗 transformers model *(still work in progress)*

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%