T5 Tensor Parallelism #2

dbogunowicz · 2024-03-22T12:36:17Z

Feature Description

The original PR was incorrectly implementing tensor parallelism for the T5 model. This PR adds the missing feature.

Details

The goal is to ensure that the "model parallel" wrapper classes (ColumnParallelLinear, RowParallelLinear) around typical transformers operation work correctly, as described in the original paper: https://arxiv.org/pdf/1909.08053.pdf.

This PR makes sure that we are:

parallelizing the atomic operations of the transformer architecture (MLP and Attention blocks are distributed across GPUs)
correctly initializing ColumnParallelLinear and RowParallelLinear modules in the Attention module (the modules take care for us of calculating the correct sharded tensor dimensions - this PR takes advantage of it)
swapping the token embedding layer module from nn.Embedding (non-parallelizable) to VocabParallelEmbedding. This allows for the parallelization of the embedding GEMM.
correctly loaded the weights for the layers that need to be sharded across multiple GPUs. This means either:
a) taking an original checkpoint weight matrix, splitting it into num_shards parts, and then sending it off to the appropriate devices. E.g. assuming four shards and column-wise sharding, if the matrix is [N, M], we create four [N/4, M] shards and send them to four model copies.
b) for the relative positional embeddings, if sharding is enabled, we need to only load the small chunk of the original matrix, such that its dimensions match the dimensions of the sharded hidden dimension.

Testing

Made sure that the output of the model is correct not only for an unsharded model but also for the models distributed across two and four shards (see examples/offline_inference_enc_dec.py)

dbogunowicz added 3 commits March 22, 2024 12:34

initial commit

653e480

tensor parallelism working for 1,2 and 4 shards

eb5ee84

ready for review

b9663c8

dbogunowicz requested a review from afeldman-nm March 25, 2024 11:42

dbogunowicz closed this Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T5 Tensor Parallelism #2

T5 Tensor Parallelism #2

dbogunowicz commented Mar 22, 2024 •

edited

Loading

T5 Tensor Parallelism #2

T5 Tensor Parallelism #2

Conversation

dbogunowicz commented Mar 22, 2024 • edited Loading

Feature Description

Details

Testing

dbogunowicz commented Mar 22, 2024 •

edited

Loading