Skip to content

Commit

Permalink
LLM: Partial Prefilling for Pipeline Parallel Serving (#11457)
Browse files Browse the repository at this point in the history
LLM: Partial Prefilling for Pipeline Parallel Serving
  • Loading branch information
xiangyuT authored Jul 5, 2024
1 parent 72b4efa commit 7d8bc83
Show file tree
Hide file tree
Showing 4 changed files with 261 additions and 102 deletions.
6 changes: 5 additions & 1 deletion python/llm/example/GPU/Pipeline-Parallel-FastAPI/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,12 @@ pip install trl==0.8.1
bash run.sh
```

> Note: INT4 optimization is applied to the model by default. You could specify other low bit optimizations (such as 'fp8' and 'fp6') through `--low-bit`. Besides, you could change `NUM_GPUS` to the number of GPUs you have on your machine.
### Command Line Arguments in `run.sh`
> Note: INT4 optimization is applied to the model by default. You could specify other low bit optimizations (such as 'fp8' and 'fp6') through `--low-bit`. Besides, you could change `NUM_GPUS` to the number of GPUs you have on your machine. Other relative settings are listed below:
- `--low-bit`: Sets the low bit optimizations (such as 'sym_int4', 'fp16', 'fp8' and 'fp6') for the model.
- `--max-num-seqs`: Sets the maximum batch size on a single card during pipeline parallel serving.
- `--max-prefilled-seqs`: Sets the maximum batch size for prefilled sequences. Use `0` to disable partial prefetching and process all requests in a single batch.

### 3. Sample Input and Output

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -306,18 +306,21 @@ async def main():
help='The port number on which the server will run.')
parser.add_argument('--max-num-seqs', type=int, default=8,
help='Max num sequences in a batch.')
parser.add_argument('--max-prefilled-seqs', type=int, default=0,
help='Max num sequences in a batch during prefilling.')

args = parser.parse_args()
model_path = args.repo_id_or_model_path
low_bit = args.low_bit
max_num_seqs = args.max_num_seqs
max_prefilled_seqs = args.max_prefilled_seqs

# serialize model initialization so that we do not run out of CPU memory
for i in range(my_size):
if my_rank == i:
logger.info("start model initialization")
global local_model
local_model = ModelRunner(model_path, my_rank, my_size, low_bit, max_num_seqs)
local_model = ModelRunner(model_path, my_rank, my_size, low_bit, max_num_seqs, max_prefilled_seqs)
logger.info("model initialized")
dist.barrier()
# Load tokenizer
Expand Down
6 changes: 4 additions & 2 deletions python/llm/example/GPU/Pipeline-Parallel-FastAPI/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,13 @@ source $basekit_root/setvars.sh --force
source $basekit_root/ccl/latest/env/vars.sh --force

export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
if [[ $KERNEL_VERSION != *"6.5"* ]]; then
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
fi
export TORCH_LLM_ALLREDUCE=0

export MODEL_PATH=YOUR_MODEL_PATH
export NUM_GPUS=2
export IPEX_LLM_QUANTIZE_KV_CACHE=1

CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS pipeline_serving.py --repo-id-or-model-path $MODEL_PATH --low-bit fp8 --max-num-seqs 4
CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS pipeline_serving.py --repo-id-or-model-path $MODEL_PATH --low-bit fp8 --max-num-seqs 4 --max-prefilled-seqs 0
Loading

0 comments on commit 7d8bc83

Please sign in to comment.