Accelerate LLaMA model loading #234

JF-D · 2023-06-25T07:31:32Z

This PR is for accelerating LLaMA model weights loading with safetensors. I find current load weight implementation doubles the time cost as the tensor-model parallelism increases (refer to the belowing loading time table for LLaMA-65B).

Parallelism Degree	Original (minutes)	Safetensors (minutes)
1	~5	~5
2	~10	~5
4	~10	~5

I think it is ready for review.
Code adapted from https://github.com/huggingface/text-generation-inference/blob/v0.8.2/server/text_generation_server/models/flash_llama.py#L206

AlpinDale · 2023-06-28T17:30:47Z

Oh sorry, didn't mean to do that. :P

lucasjinreal · 2023-07-12T05:28:32Z

@AlpinDale Can merge this? Currently model loading are extremly slow

creatorrr · 2023-07-18T08:31:41Z

@JF-D could you please add some comments to your changes? A tad hard to read them at the moment 😬

JF-D · 2023-07-19T12:25:31Z

Resolve conflicts for reference.

zhuohan123

@JF-D Sorry for the long delay. A lot of people have actually asked for safetensor support and your PR looks great for LLaMA models! Do you think there is a possibility to extend what you did to all models by just modifying the hf_model_weights_iterator function?

JF-D · 2023-08-05T07:29:50Z

@zhuohan123 I think it's possible, and I've updated the hf_model_weights_iterator function. Maybe you can review it.

vllm/model_executor/weight_utils.py

zhuohan123

Again, thank you for your great contribution! We tested the code last week and it works great! Left some questions and small suggestions. Once fixed, this PR should be able to be merged. BTW you can run format.sh --all to format your changes.

zhuohan123 · 2023-08-22T05:43:58Z

vllm/model_executor/models/llama.py

+            if "embed_tokens" in name or "lm_head" in name:
+                param = state_dict[name]
+                # Consider padding in the vocab size.
+                padded_vocab_size = param.shape[0] * tp_size
+                if padded_vocab_size > self.config.vocab_size:
+                    load_padded_tensor_parallel_vocab(param, loaded_weight, name,
+                                                      self._column_parallel_weights,
+                                                      self._row_parallel_weights,
+                                                      tensor_model_parallel_rank)
+                    continue


Is this part a must-have change for safetensors, or is it another optimization? If latter, maybe we can include this part in another PR and keep this PR merely about loading safetensors?

If we can assume that the vocab will not be padded, this is not a must-have change. Can we have such assumption here?

vllm/model_executor/weight_utils.py

zhuohan123 · 2023-08-22T05:47:32Z

vllm/model_executor/weight_utils.py

+def load_padded_tensor_parallel_vocab(
+    param: torch.Tensor,
+    loaded_weight: torch.Tensor or object,
+    param_name: str,
+    column_parallel_weight_names: List[str],
+    row_parallel_weight_names: List[str],
+    tensor_model_parallel_rank: int,
+) -> None:
+    for p in column_parallel_weight_names:
+        if p in param_name:
+            shard_size = param.shape[0]
+            start_idx = tensor_model_parallel_rank * shard_size
+            end_idx = (tensor_model_parallel_rank + 1) * shard_size
+            loaded_weight = loaded_weight[start_idx:end_idx]
+            break
+
+    # convert PySafeSlice object to torch.Tensor
+    if not isinstance(loaded_weight, torch.Tensor):
+        loaded_weight = loaded_weight[:]
+
+    param[:loaded_weight.shape[0]].copy_(loaded_weight)


ditto, we can exclude this function from this PR if it's not related to safetensors.

vllm/model_executor/weight_utils.py

zhuohan123

Thank you for your contribution again! I will merge this PR first, and then add safetensor loading for other models in another PR.

) * fix code path logic to load mllama model * fix lint error * fix lint error --------- Co-authored-by: tjtanaa <[email protected]>

[feat] use safetensors to accelerate weights loading

1607b9f

zhuohan123 self-requested a review June 25, 2023 15:01

AlpinDale approved these changes Jun 28, 2023

View reviewed changes

Merge branch 'main' of https://github.com/vllm-project/vllm into main

a2e4707

zhuohan123 requested changes Jul 25, 2023

View reviewed changes

Merge branch 'main' of https://github.com/vllm-project/vllm into main

3074a67

AlpinDale reviewed Aug 16, 2023

View reviewed changes

vllm/model_executor/weight_utils.py Outdated Show resolved Hide resolved

fix 'safe_open' import

76695e9

zhuohan123 requested changes Aug 22, 2023

View reviewed changes

JF-D and others added 3 commits August 23, 2023 20:27

fix allow_patterns, lint

61d4ba1

Merge branch 'main' into JF-D/main

81c8c4a

Do not automatically convert to safe tensor & format fixes

8ccedcf

zhuohan123 approved these changes Aug 30, 2023

View reviewed changes

zhuohan123 merged commit 0d93f15 into vllm-project:main Aug 30, 2023
2 checks passed

liuyanyi pushed a commit to liuyanyi/vllm that referenced this pull request Sep 12, 2023

Accelerate LLaMA model loading (vllm-project#234)

b5402aa

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Accelerate LLaMA model loading (vllm-project#234)

c03178f

JF-D deleted the main branch March 5, 2024 07:18

JF-D restored the main branch March 5, 2024 07:19

JF-D deleted the main branch March 5, 2024 07:20

sjchoi1 pushed a commit to casys-kaist-internal/vllm that referenced this pull request May 7, 2024

Accelerate LLaMA model loading (vllm-project#234)

6051996

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate LLaMA model loading #234

Accelerate LLaMA model loading #234

JF-D commented Jun 25, 2023

AlpinDale commented Jun 28, 2023 •

edited

Loading

lucasjinreal commented Jul 12, 2023

creatorrr commented Jul 18, 2023

JF-D commented Jul 19, 2023

zhuohan123 left a comment

JF-D commented Aug 5, 2023

zhuohan123 left a comment

zhuohan123 Aug 22, 2023

JF-D Aug 23, 2023

zhuohan123 Aug 22, 2023

zhuohan123 left a comment

Accelerate LLaMA model loading #234

Accelerate LLaMA model loading #234

Conversation

JF-D commented Jun 25, 2023

AlpinDale commented Jun 28, 2023 • edited Loading

lucasjinreal commented Jul 12, 2023

creatorrr commented Jul 18, 2023

JF-D commented Jul 19, 2023

zhuohan123 left a comment

Choose a reason for hiding this comment

JF-D commented Aug 5, 2023

zhuohan123 left a comment

Choose a reason for hiding this comment

zhuohan123 Aug 22, 2023

Choose a reason for hiding this comment

JF-D Aug 23, 2023

Choose a reason for hiding this comment

zhuohan123 Aug 22, 2023

Choose a reason for hiding this comment

zhuohan123 left a comment

Choose a reason for hiding this comment

AlpinDale commented Jun 28, 2023 •

edited

Loading