🚧 PipeGoose: Training any 🤗 `transformers` in Megatron-LM 3D parallelism out of the box

Honk honk honk! This project is actively under development. Check out my learning progress here.

⚠️ The project is actively under development and not ready for use.

⚠️ The APIs is still a work in progress and could change at any time. None of the public APIs are set in stone until we hit version 0.6.9.

import torch
import torch.nn.functional as F
from transformer import AutoModel, AutoTokenizer
from datasets import load_dataset
+ from pipegoose import DataParallel, TensorParallel, PipelineParalell, ParallelContext
+ from pipegoose.optim import DistributedOptimizer

model = AutoModel.from_pretrained("bloom")
tokenizer = AutoTokenizer.from_pretrained("bloom")

- device = "cuda"
- model = model.to(device)
+ parallel_context = ParallelContext(
+    tensor_parallel_size=2,
+    data_parallel_size=2,
+    pipeline_parallel_size=2
+ )
+ model = DataParallel(model, parallel_context).parallelize()
+ model = TensorParallel(model, parallel_context).parallelize()
+ model = PipelineParallel(model, parallel_context).parallelize()

optimizer = torch.optim.Adam(model.parameters())
+ optimizer = DistributedOptimizer(optimizer, parallel_context)

dataset = load_dataset('goose')
dataloader = torch.utils.data.DataLoader(dataset, batch_size=42)

for epoch in range(69):
    for inputs, targets in dataloader:
-         inputs = inputs.to(device)
-         targets = targets.to(device)

        output = model(inputs)
        loss = F.cross_entropy(output, targets)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Implementation Details

Supports training transformers model in Megatron 3D parallelism and ZeRO-1 (write from scratch).
Implements parallel compute and data transfer using separate CUDA streams.
Gradient checkpointing will be implemented by enforcing virtual dependency in the backpropagation graph, ensuring that the activation for gradient checkpoint will be recomputed just in time for each (micro-batch, partition).
Custom algorithms for model partitioning with two default partitioning models based on elapsed time and GPU memory consumption per layer.
Potential support includes:
- Callbacks within the pipeline: Callback(function, microbatch_idx, partition_idx) for before and after the forward, backward, and recompute steps (for gradient checkpointing).
- Mixed precision training.

Appreciation

Big thanks to 🤗 Hugging Face for sponsoring this project with 8x A100 GPUs for testing! And Zach Schrier for monthly twitch donations
The library's APIs are inspired by OSLO's and ColossalAI's APIs.

Name		Name	Last commit message	Last commit date
Latest commit History 419 Commits
.github/workflows		.github/workflows
docs		docs
pipegoose		pipegoose
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
3d-parallelism.png		3d-parallelism.png
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚧 PipeGoose: Training any 🤗 `transformers` in Megatron-LM 3D parallelism out of the box

About

Releases

Packages

Languages

License

pterodactyl-soup/pipegoose

Folders and files

Latest commit

History

Repository files navigation

🚧 PipeGoose: Training any 🤗 transformers in Megatron-LM 3D parallelism out of the box

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

🚧 PipeGoose: Training any 🤗 `transformers` in Megatron-LM 3D parallelism out of the box

Packages