-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Batched Multi-LoRA inference failure with random length dataset #237
Comments
@tae-su-kim Please add --max-num-batched-tokens to server command, where value >= max_num_seqs * max_seq_len. E.g. for num_requests = 16 and max-input-len 1024, value >= 16384. With this additional parameter added to server command I was able to execute commands you shared without errors. For testing larger batch-sizes > 32 you will run into Assert or OOM issues. Please use #223 which has optimizations to enable executing with larger batch-sizes. |
@vivekgoe Thanks for the fast response. I will definitely check #223 for further benchmarks. Thanks for the lookahead! |
@tae-su-kim Adding assertion to check max_num_batched_tokens >= max_num_seqs * max_seq_len for enable_lora = True is a good suggestion, will add it. Regarding possibility of change in scheduler to handle this better, we will discuss this and get back. |
I encountered the following errors while conducting experiments in the same environment as @tae-su-kim. There are two notable issues: First, errors occur when the number of requests increases. Second, even with the same number of requests, setting max-out-len=1 results in an error. Server script:
Client script:
Error Message... Any insights or suggestions to resolve these would be greatly appreciated! |
@JHLEE17 Please share the commit which you used for above. Will check and get back. |
Cases 1 and 2 [batch size 128]: Device out-of-memory issue is observed with enforce-eager flag and warmup enabled. We have reported the issue internally and have started looking into it with priority. If enforce-eager flag is removed and warmup-run is enabled, the test runs without any issues. You could continue your experiments with this configuration. case 3 [batch size 256]: For higher batch sizes like 256, we have also observed the device out-of-memory issue and debug is in progress. Note: To get better performance, set max-num-batched-tokens to max-num-seqs * max-model-len. Command used to run server is shared below. vllm-fork head f858d43
|
For case 1 and 2, profile-run was under-estimating the expected memory usage. Setting VLLM_PROMPT_BS_BUCKET_MAX to 128 will fix the OOM issue. Command used to run server in without HPUGraphs is shared below. vllm-fork head 4c1ca3a
|
I’ve successfully tested with You can reproduce the results with these commands. Tested on vllm-fork head 53f96b7 & 35a4a98 (latest)
Client:
|
I could reproduce the device OOM issue locally with the given configuration. The same issue is observed without LoRA also and the related debug is in progress. |
…th LoRA (#343) This PR has following fixes, - Increase size of indices tensors used to maintain multi-lora state information from max_num_batched_tokens to 3*max_num_batched_tokens. This increase is done to provide buffer for padding done in batch & sequence dimensions. - Move logic to remove padding from lora_logits from execute_model() back to Class LogitsProcessorWithLoRA, this is done to fix race condition caused by updating multi-lora state information directly. FIX #237
…th LoRA (HabanaAI#339) This PR has following fixes, - Increase size of indices tensors used to maintain multi-lora state information from max_num_batched_tokens to 3*max_num_batched_tokens. This increase is done to provide buffer for padding done in batch & sequence dimensions. - Move logic to remove padding from lora_logits from execute_model() back to Class LogitsProcessorWithLoRA, this is done to fix race condition caused by updating multi-lora state information directly. FIX HabanaAI#237
Anything you want to discuss about vllm.
Environment:
Current implementation of batched multi-lora suffers from RuntimeError on online serving scenarios (e.g. OpenAI API). This bug can be reproduced with following script:
Server:
Send requests with any dataset and LoRA pattern to the API server with number of requests >= 8. Below is an example command line for our benchmark script (https://github.com/SqueezeBits/vllm-fork/tree/benchmark).
Then, following error occurs:
This error happens due to the number of tokens in a prefill batch being larger than max_num_batched_tokens. As discussed in PR #109 , current implementation of prefill scheduler may let a prefill batch to have the number of tokens exceeding max_num_batched tokens after padding. While, several indices information for LoRA in LoRAModelManager is designed to support only up to max_num_batched_tokens:
vllm-fork/vllm/lora/models.py
Lines 425 to 440 in b4f6a29
This causes L330 in vllm/lora/layer.py to fail on view_as(x).
Suggested solutions are either (1) to merge PR #109 or (2) increase the size of embeddings_indices and other indices to the maximum number of padded prefill tokens under max_num_batched_tokens constraint.
The text was updated successfully, but these errors were encountered: