LLM: Partial Prefilling for Pipeline Parallel Serving #11457

xiangyuT · 2024-06-28T01:44:52Z

Description

Add continuous-batching-like partial prefilling to reduce the memory peak during prefilling.

Below is an example with a batch_size=4, max_prefilled_seqs=2 BatchTask:

Initially:
First partial prefilling:
Second partial prefilling:
Decoding:

4. How to test?

Unit test: Please manually trigger the PR Validation here by inputting the PR number (e.g., 1234). And paste your action link here once it has been successfully finished.
https://github.com/intel-analytics/ipex-llm-workflow/actions/runs/9802540844
Local test

qiyuangong · 2024-07-02T03:09:52Z

python/llm/src/ipex_llm/transformers/pipeline_parallel.py

+    def prepare_batch(self, cur_batch):
+        if self.rank == 0:
+            cur_input_start = cur_batch.prefilled_index
+            if self.max_prefilled_seqs > 0:


Do we need to set cur_batch.partial_prefilling = 0, when max_prefilled_seqs==0.

qiyuangong · 2024-07-02T03:11:00Z

python/llm/src/ipex_llm/transformers/pipeline_parallel.py

@@ -146,6 +146,8 @@ def pipeline_parallel(model, pipeline_parallel_stages):
            model._modules['lm_head'] = DummyLayer()

    model.pipeline_parallel_stages = pipeline_parallel_stages
+    model.layer_start = layer_start


Can we replace gloabal layer_start and layer_end with model.layer-start etc

qiyuangong

LGTM

…s#11457) LLM: Partial Prefilling for Pipeline Parallel Serving

xiangyuT added 5 commits June 27, 2024 23:25

init

c157bcd

refine

bf788c3

add support for chatglm2/3

d1af961

fix

be7cc05

format

5627c4d

xiangyuT changed the title ~~[WIP] LLM: Partial Prefilling for Pipeline Parallel Serving~~ LLM: Partial Prefilling for Pipeline Parallel Serving Jul 2, 2024

xiangyuT requested review from glorysdj, hkvision, lalalapotter and plusbang July 2, 2024 01:37

qiyuangong reviewed Jul 2, 2024

View reviewed changes

xiangyuT added 6 commits July 3, 2024 14:28

refine

7ffeb75

try to merge main

6fe013f

refine

cf1e9e2

format

fdc87df

fix

7e74090

refine readme

6e74538

qiyuangong approved these changes Jul 5, 2024

View reviewed changes

xiangyuT merged commit 7d8bc83 into intel-analytics:main Jul 5, 2024
1 check passed

RyuKosei pushed a commit to RyuKosei/ipex-llm that referenced this pull request Jul 19, 2024

LLM: Partial Prefilling for Pipeline Parallel Serving (intel-analytic…

55d2177

…s#11457) LLM: Partial Prefilling for Pipeline Parallel Serving

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM: Partial Prefilling for Pipeline Parallel Serving #11457

LLM: Partial Prefilling for Pipeline Parallel Serving #11457

xiangyuT commented Jun 28, 2024 •

edited

Loading

qiyuangong Jul 2, 2024

qiyuangong Jul 2, 2024

qiyuangong left a comment

LLM: Partial Prefilling for Pipeline Parallel Serving #11457

LLM: Partial Prefilling for Pipeline Parallel Serving #11457

Conversation

xiangyuT commented Jun 28, 2024 • edited Loading

Description

4. How to test?

qiyuangong Jul 2, 2024

Choose a reason for hiding this comment

qiyuangong Jul 2, 2024

Choose a reason for hiding this comment

qiyuangong left a comment

Choose a reason for hiding this comment

xiangyuT commented Jun 28, 2024 •

edited

Loading