[Model Loading] Speedup model loading with distributed loading #3729

chestnut-Q · 2024-03-29T08:58:55Z

Hello! The current method for model loading is quite fixed, regardless of the tensor parallel size. It involves each rank in a tp group reading the full weight file, and then discarding the excess weight tensors if only a portion of the parameters is needed for that rank. When --tensor-parallel-size is greater than 1, most parameters require only 1/tp_size of the parameters, leading to significant additional weights IO.

Observing that the disk IO speed is slow (particularly for bin files), and the transfer rate between multiple GPUs is fast, we can adopt a distributed loading approach. This means each worker loads only 1/tp_size of the weight file (by file division, or for SafeTensors type, it can be by tensor division). Then the parameters needed by the workers are transferred to each other using torch.distributed.scatter or torch.distributed.broadcast. This approach can reduce disk IO to 1/tp_size.

I have implemented the example distributed loading code in llama.py and baichuan.py. I believe other models (if needed) can easily implement similar logic. To ensure compatibility with previous codes, the args introduced in this PR are optional. Therefore, if you do not wish to use distributed loading, the original code does not require any modifications.

When --tensor-parallel-size >= 4, the distributed loading method can significantly accelerate loading times, typically by 40% or more. Here are the experiment results on my machine (8*A100) for Llama-2-70b and Baichuan2-13B.

	Llama-2-70b (TP8)	Baichuan2-13B (TP4)
Vanilla	249.5s	45.3s
Distributed	141.5s	25.2s
Speedup	43.3%	44.4%

ywang96

Hello! Thanks for the PR and I left some comments & questions. This is indeed a neat feature but please consider the use of utils & helper modules.

IMO ideally we would like to reduce the need of changing each model file as much as possible to make the codebase easier to maintain.

vllm/model_executor/layers/activation.py

vllm/model_executor/models/baichuan.py

vllm/worker/model_runner.py

vllm/model_executor/models/baichuan.py

chestnut-Q · 2024-03-31T14:21:44Z

@ywang96 Thank you for your suggestion! I have moved the duplicate codes to utils as per your advice, which resulted in minimal changes to each model file and made them clearer. Additionally, I have added the cli arg --use-distributed-loading and expanded support for other models that have a parameter structure similar to llama. If there are any more questions or suggestions, welcome to comment :)

BenNR · 2024-05-18T17:59:33Z

Hi guys this is a truly valuable feature is this still moving forward into an official vllm release? Not trying to be pushy great work and cool concept though!

sdake · 2024-09-02T11:23:34Z

cc/ @sdake

youkaichao · 2024-09-03T00:01:24Z

close as this becomes stale.

and please see #6127 (comment)

we recommend using safetensors format, and then we don't need this optimization.

ywang96 requested changes Mar 31, 2024

View reviewed changes

ywang96 self-assigned this Apr 1, 2024

chestnut-Q force-pushed the main branch from 18865e5 to 0d58ced Compare April 2, 2024 03:10

chestnut-Q added 2 commits April 4, 2024 23:35

Distributed weight loading

3206586

Run yapf and ruff

f01973e

chestnut-Q force-pushed the main branch from 3ae0482 to f01973e Compare April 4, 2024 15:46

chestnut-Q requested a review from ywang96 April 4, 2024 15:46

Run yapf and ruff

2c3939a

njhill mentioned this pull request May 10, 2024

[Core] Implement sharded state loader #4690

Merged

njhill mentioned this pull request Jul 4, 2024

[core][distributed] accelerate distributed weight loading #6127

Closed

youkaichao closed this Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model Loading] Speedup model loading with distributed loading #3729

[Model Loading] Speedup model loading with distributed loading #3729

chestnut-Q commented Mar 29, 2024

ywang96 left a comment

chestnut-Q commented Mar 31, 2024

BenNR commented May 18, 2024

sdake commented Sep 2, 2024

youkaichao commented Sep 3, 2024

[Model Loading] Speedup model loading with distributed loading #3729

[Model Loading] Speedup model loading with distributed loading #3729

Conversation

chestnut-Q commented Mar 29, 2024

ywang96 left a comment

Choose a reason for hiding this comment

chestnut-Q commented Mar 31, 2024

BenNR commented May 18, 2024

sdake commented Sep 2, 2024

youkaichao commented Sep 3, 2024