Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Phi-3-vision: ERROR 08-09 11:41:40 async_llm_engine.py:56] RuntimeError: stack expects each tensor to be equal size, but got [1933, 4096] at entry 0 and [2509, 4096] at entry 1 #7373

Closed
pseudotensor opened this issue Aug 9, 2024 · 14 comments · Fixed by #7392
Labels
bug Something isn't working

Comments

@pseudotensor
Copy link

pseudotensor commented Aug 9, 2024

Your current environment

docker latest for 0.5.3

docker pull vllm/vllm-openai:latest
docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=1"' \
    --shm-size=10.24gb \
    -p 5063:5063 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -v "${HOME}"/.cache:$HOME/.cache/ -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name phi3vision \
    vllm/vllm-openai:latest \
        --port=5063 \
        --host=0.0.0.0 \
        --model=microsoft/Phi-3-vision-128k-instruct \
        --tensor-parallel-size=1 \
        --seed 1234 \
        --trust-remote-code \
        --max-model-len=131072 \
        --max-num-batched-tokens 131072 \
        --max-num-seqs=17 \
        --max-log-len=100 \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.phi3vision.txt

🐛 Describe the bug

Was using phi-3 for 2 weeks without issue, many images etc. Unsure exactly what caused it, but this is the failure.

Clearly image processing issue.

 08-09 11:41:40 async_llm_engine.py:173] Added request chat-3379ed24490d4f398fc0db684039f72e.
WARNING 08-09 11:41:40 chat_utils.py:146] 'image_url.detail' is currently not supported and will be ignored.
INFO 08-09 11:41:40 logger.py:36] Received request chat-fcbb9c8388104f9ca63b7ecd29b89b43: prompt: '<|system|>\nYou are h2oGPTe, an expert question-answering AI system created by H2O.ai.<|end|>\n<|user|', params: Sa
mplingParams(n=1, best_of=1, presence_penalty=0.14000000000000012, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=31248, use_beam_search=False, length_penalty=1
.0, early_stopping=False, stop=[], stop_token_ids=[32000, 32000], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, sp
aces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [32006, 29871, 13, 3492, 526, 298, 29906, 29877, 19903, 7141, 29892, 385, 17924, 1139, 29899, 12011, 292, 319, 29902, 1788, 2825, 49
1, 379, 29906, 29949, 29889, 1794, 29889, 32007, 29871, 13, 32010, 29871, 13, 29966, 29989, 3027, 29918, 29896, 29989, 29958, 13, 29966, 5327, 29918, 2611, 582, 1953, 29958, 13, 29899, 3185, 408, 263, 28430, 22944,
 411, 263, 15301, 10977, 363, 9493, 29889, 13, 29899, 11597, 29891, 911, 278, 2793, 2629, 278, 4558, 29889, 13, 29899, 9133, 680, 1663, 5861, 2729, 373, 596, 13917, 29889, 13, 29899, 319, 5405, 3907, 701, 17099, 29
889, 13, 29899, 1938, 451, 9566, 304, 1101], lora_request: None, prompt_adapter_request: None.
INFO:     172.16.0.234:20550 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 08-09 11:41:40 async_llm_engine.py:173] Added request chat-fcbb9c8388104f9ca63b7ecd29b89b43.
ERROR 08-09 11:41:40 async_llm_engine.py:56] Engine background task failed
ERROR 08-09 11:41:40 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 08-09 11:41:40 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
ERROR 08-09 11:41:40 async_llm_engine.py:56]     return_value = task.result()
ERROR 08-09 11:41:40 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 637, in run_engine_loop
ERROR 08-09 11:41:40 async_llm_engine.py:56]     result = task.result()
ERROR 08-09 11:41:40 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 580, in engine_step
ERROR 08-09 11:41:40 async_llm_engine.py:56]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 08-09 11:41:40 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 253, in step_async
ERROR 08-09 11:41:40 async_llm_engine.py:56]     output = await self.model_executor.execute_model_async(
ERROR 08-09 11:41:40 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 159, in execute_model_async
ERROR 08-09 11:41:40 async_llm_engine.py:56]     output = await make_async(self.driver_worker.execute_model
ERROR 08-09 11:41:40 async_llm_engine.py:56]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 08-09 11:41:40 async_llm_engine.py:56]     result = self.fn(*self.args, **self.kwargs)
ERROR 08-09 11:41:40 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 272, in execute_model
ERROR 08-09 11:41:40 async_llm_engine.py:56]     output = self.model_runner.execute_model(
ERROR 08-09 11:41:40 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 08-09 11:41:40 async_llm_engine.py:56]     return func(*args, **kwargs)
ERROR 08-09 11:41:40 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1314, in execute_model
ERROR 08-09 11:41:40 async_llm_engine.py:56]     hidden_or_intermediate_states = model_executable(
ERROR 08-09 11:41:40 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 08-09 11:41:40 async_llm_engine.py:56]     return self._call_impl(*args, **kwargs)
ERROR 08-09 11:41:40 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 08-09 11:41:40 async_llm_engine.py:56]     return forward_call(*args, **kwargs)
ERROR 08-09 11:41:40 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/phi3v.py", line 529, in forward
ERROR 08-09 11:41:40 async_llm_engine.py:56]     vision_embeddings = self.vision_embed_tokens(
ERROR 08-09 11:41:40 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 08-09 11:41:40 async_llm_engine.py:56]     return self._call_impl(*args, **kwargs)
ERROR 08-09 11:41:40 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 08-09 11:41:40 async_llm_engine.py:56]     return forward_call(*args, **kwargs)
ERROR 08-09 11:41:40 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/phi3v.py", line 166, in forward
ERROR 08-09 11:41:40 async_llm_engine.py:56]     image_features_proj = self.hd_feature_transform(
ERROR 08-09 11:41:40 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/phi3v.py", line 219, in hd_feature_transform
ERROR 08-09 11:41:40 async_llm_engine.py:56]     torch.stack(all_image_embeddings).to(target_device, target_dtype)
ERROR 08-09 11:41:40 async_llm_engine.py:56] RuntimeError: stack expects each tensor to be equal size, but got [1933, 4096] at entry 0 and [2509, 4096] at entry 1
Exception in callback functools.partial(<function _log_task_completion at 0x707b33318430>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x707b16ac2ce0>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x707b33318430>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x707b16ac2ce0>>)>

@pseudotensor pseudotensor added the bug Something isn't working label Aug 9, 2024
@DarkLight1337
Copy link
Member

DarkLight1337 commented Aug 10, 2024

Could you share the images which caused this issue?

Also cc @Isotr0py

@pseudotensor
Copy link
Author

I wish I could, I think it was just same image as we normally have for testing/benchmarking that did it, so was same images we always use.

@Isotr0py
Copy link
Collaborator

OK, I will check it later.

@jaywonchung
Copy link
Contributor

jaywonchung commented Aug 10, 2024

I'm suddenly hitting this myself as well. Running v0.5.4, and the same thing used to work well in v0.5.2 previously.

My error just has different numbers in tensor sizes:

RuntimeError: stack expects each tensor to be equal size, but got [1921, 4096] at entry 0 and [1933, 4096] at entry 1

@Isotr0py
Copy link
Collaborator

I served the latest Phi-3-vision model and ran the openai_vision_api_client.py with v0.5.4 release without any error:

INFO 08-10 03:29:53 logger.py:36] Received request chat-c1e598065d3448c298ff4e1198f3c00a: prompt: '<|user|>\n<|image_1|>\nWhat’s in this image?<|end|>\n<|assistant|>\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=64, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [32010, 29871, 13, 29966, 29989, 3027, 29918, 29896, 29989, 29958, 13, 5618, 30010, 29879, 297, 445, 1967, 29973, 32007, 29871, 13, 32001], lora_request: None, prompt_adapter_request: None.
INFO 08-10 03:29:53 async_llm_engine.py:174] Added request chat-c1e598065d3448c298ff4e1198f3c00a.
INFO 08-10 03:29:53 metrics.py:406] Avg prompt throughput: 142.0 tokens/s, Avg generation throughput: 5.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 14.8%, CPU KV cache usage: 0.0%.
INFO 08-10 03:29:55 async_llm_engine.py:141] Finished request chat-c1e598065d3448c298ff4e1198f3c00a.
INFO:     127.0.0.1:51464 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 08-10 03:29:55 logger.py:36] Received request chat-ed3c19c5762d4e7a83645a4d284dd7b7: prompt: '<|user|>\n<|image_1|>\nWhat’s in this image?<|end|>\n<|assistant|>\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=64, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [32010, 29871, 13, 29966, 29989, 3027, 29918, 29896, 29989, 29958, 13, 5618, 30010, 29879, 297, 445, 1967, 29973, 32007, 29871, 13, 32001], lora_request: None, prompt_adapter_request: None.
INFO 08-10 03:29:55 async_llm_engine.py:174] Added request chat-ed3c19c5762d4e7a83645a4d284dd7b7.
INFO 08-10 03:29:58 metrics.py:406] Avg prompt throughput: 773.4 tokens/s, Avg generation throughput: 13.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 15.1%, CPU KV cache usage: 0.0%.
INFO 08-10 03:29:58 async_llm_engine.py:141] Finished request chat-ed3c19c5762d4e7a83645a4d284dd7b7.
INFO:     127.0.0.1:51464 - "POST /v1/chat/completions HTTP/1.1" 200 OK

@pseudotensor @jaywonchung Could you provide the problematic images for reproduction?

@Isotr0py
Copy link
Collaborator

Isotr0py commented Aug 10, 2024

I'm suddenly hitting this myself as well. Running v0.5.4, and the same thing used to work well in v0.5.2 previously.

My error just has different numbers in tensor sizes:

RuntimeError: stack expects each tensor to be equal size, but got [1921, 4096] at entry 0 and [1933, 4096] at entry 1

For most of case, the torch.stack used in phi3v image_embedding should only accept and work on a list with single tensor because we haven't added multiple images support for Phi-3-vision.

Seems that the error is caused by the image size with image_num>2 created for some reasons.

@pseudotensor
Copy link
Author

I am only ever using one image at a time. There is an assert in vllm preventing multiple images.

@Isotr0py
Copy link
Collaborator

Isotr0py commented Aug 10, 2024

Are all images having this issue or just some of them? Or can you try openai_vision_api_client.py to see if it worked?

I know we are preventing multiple images, there are other things caused a problematic image_size.

Since I can't reproduce this error, I need more details to figure out what's happening.

@jaywonchung
Copy link
Contributor

Thanks for looking into this.

data.json -- Ten text prompt and base64-encoded jpeg image pairs. The file is a length-ten list of:

@dataclass
class Request:
    image: str
    prompt: str

I get the error when I throw all ten of the request to the /v1/chat/completions API with payload:

    pload = {
        "model": model,
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": [
                {
                    "type": "text",
                    "text": prompt,
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image}",
                    },
                },
            ]},
        ],
        "stream": False,
        "max_tokens": 1024,
        "temperature": 0.8,
        "top_p": 0.95,
        "stop": ["\nUser:", "<|endoftext|>", "</s>"],
    }

vLLM INFO log for one of the requests:

INFO 08-09 21:18:19 logger.py:36] Received request chat-965c22b661674f47b539051015e8b9f9: prompt: "<|system|>\nYou are an artificial intelligence as
sistant that gives helpful answers to the user's questions or instructions.<|end|>\n<|user|>\n<|image_1|>\nWhat is the primary activity taking place
 on the beach in the image? What is the condition of the kite's tail in the image? What does the boy in the image appear to be doing with the kite? 
What are some benefits of kite flying as an outdoor activity?<|end|>\n<|assistant|>\n", params: SamplingParams(n=1, best_of=1, presence_penalty=0.0,
 frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.8, top_p=0.95, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1
.0, early_stopping=False, stop=['\nUser:', '<|endoftext|>', '</s>'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_toke
ns=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=Non
e), prompt_token_ids: [32006, 29871, 13, 3492, 526, 385, 23116, 21082, 20255, 393, 4076, 8444, 6089, 304, 278, 1404, 29915, 29879, 5155, 470, 11994,
 29889, 32007, 29871, 13, 32010, 29871, 13, 29966, 29989, 3027, 29918, 29896, 29989, 29958, 13, 5618, 338, 278, 7601, 6354, 5622, 2058, 373, 278, 25
695, 297, 278, 1967, 29973, 1724, 338, 278, 4195, 310, 278, 413, 568, 29915, 29879, 12464, 297, 278, 1967, 29973, 1724, 947, 278, 8023, 297, 278, 19
67, 2615, 304, 367, 2599, 411, 278, 413, 568, 29973, 1724, 526, 777, 23633, 310, 413, 568, 22764, 408, 385, 714, 17433, 6354, 29973, 32007, 29871, 1
3, 32001], lora_request: None, prompt_adapter_request: None.

@Isotr0py
Copy link
Collaborator

@jaywonchung It's strange that all images work well on my side with a newly created environment. Maybe you can try to create a new conda environment to install VLLM freshly?

Server Log

$ vllm serve microsoft/Phi-3-vision-128k-instruct --dtype half --api-key EMPTY --trust-remote-code --max-model-len 4096
INFO 08-10 04:52:28 api_server.py:339] vLLM API server version 0.5.4
INFO 08-10 04:52:28 api_server.py:340] args: Namespace(model_tag='microsoft/Phi-3-vision-128k-instruct', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='EMPTY', lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='microsoft/Phi-3-vision-128k-instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7e319a992050>)
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 3.66k/3.66k [00:00<00:00, 20.7MB/s]
configuration_phi3_v.py: 100%|███████████████████████████████████████████████████████████████████████████████████| 10.6k/10.6k [00:00<00:00, 41.6MB/s]
A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-vision-128k-instruct:
- configuration_phi3_v.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
WARNING 08-10 04:52:29 config.py:1454] Casting torch.bfloat16 to torch.float16.
WARNING 08-10 04:52:29 config.py:1454] Casting torch.bfloat16 to torch.float16.
INFO 08-10 04:52:29 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='microsoft/Phi-3-vision-128k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-vision-128k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=microsoft/Phi-3-vision-128k-instruct, use_v2_block_manager=False, enable_prefix_caching=False)
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 9.40k/9.40k [00:00<00:00, 33.4MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 1.85M/1.85M [00:00<00:00, 25.9MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 670/670 [00:00<00:00, 4.58MB/s]
INFO 08-10 04:52:30 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-10 04:52:30 selector.py:54] Using XFormers backend.
/opt/conda/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/opt/conda/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 08-10 04:52:31 model_runner.py:720] Starting to load model microsoft/Phi-3-vision-128k-instruct...
INFO 08-10 04:52:31 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-10 04:52:31 selector.py:54] Using XFormers backend.
INFO 08-10 04:52:32 weight_utils.py:225] Using model weights format ['*.safetensors']
model-00002-of-00002.safetensors: 100%|███████████████████████████████████████████████████████████████████████████| 3.35G/3.35G [00:18<00:00, 177MB/s]
model-00001-of-00002.safetensors: 100%|███████████████████████████████████████████████████████████████████████████| 4.94G/4.94G [00:27<00:00, 177MB/s]
model.safetensors.index.json: 100%|██████████████████████████████████████████████████████████████████████████████| 68.9k/68.9k [00:00<00:00, 3.12MB/s]
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:04<00:04,  4.56s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:07<00:00,  3.44s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:07<00:00,  3.61s/it]

INFO 08-10 04:53:07 model_runner.py:732] Loading model weights took 7.7498 GB
preprocessor_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████| 464/464 [00:00<00:00, 2.55MB/s]
image_processing_phi3_v.py: 100%|████████████████████████████████████████████████████████████████████████████████| 11.4k/11.4k [00:00<00:00, 38.2MB/s]
A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-vision-128k-instruct:
- image_processing_phi3_v.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
/opt/conda/envs/vllm/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
INFO 08-10 04:53:11 gpu_executor.py:102] # GPU blocks: 819, # CPU blocks: 682
INFO 08-10 04:53:16 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-10 04:53:16 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-10 04:53:38 model_runner.py:1225] Graph capturing finished in 22 secs.
WARNING 08-10 04:53:39 serving_embedding.py:171] embedding_mode is False. Embedding API will not work.
INFO 08-10 04:53:39 launcher.py:14] Available routes are:
INFO 08-10 04:53:39 launcher.py:22] Route: /openapi.json, Methods: GET, HEAD
INFO 08-10 04:53:39 launcher.py:22] Route: /docs, Methods: GET, HEAD
INFO 08-10 04:53:39 launcher.py:22] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 08-10 04:53:39 launcher.py:22] Route: /redoc, Methods: GET, HEAD
INFO 08-10 04:53:39 launcher.py:22] Route: /health, Methods: GET
INFO 08-10 04:53:39 launcher.py:22] Route: /tokenize, Methods: POST
INFO 08-10 04:53:39 launcher.py:22] Route: /detokenize, Methods: POST
INFO 08-10 04:53:39 launcher.py:22] Route: /v1/models, Methods: GET
INFO 08-10 04:53:39 launcher.py:22] Route: /version, Methods: GET
INFO 08-10 04:53:39 launcher.py:22] Route: /v1/chat/completions, Methods: POST
INFO 08-10 04:53:39 launcher.py:22] Route: /v1/completions, Methods: POST
INFO 08-10 04:53:39 launcher.py:22] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [3177]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     127.0.0.1:49392 - "GET /v1/models HTTP/1.1" 200 OK
INFO 08-10 04:53:44 logger.py:36] Received request chat-e72fd5687baf4e019e4b28c086fc8a7e: prompt: "<|system|>\nYou are an artificial intelligence assistant that gives helpful answers to the user's questions or instructions.<|end|>\n<|user|>\n<|image_1|>\nWhat kind of fruit can be seen in the image? Where is the fruit located? Can you describe the photograph's style or effect? Is the image clear and straightforward or does it have a unique visual style?<|end|>\n<|assistant|>\n", params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.8, top_p=0.95, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['\nUser:', '<|endoftext|>', '</s>'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [32006, 29871, 13, 3492, 526, 385, 23116, 21082, 20255, 393, 4076, 8444, 6089, 304, 278, 1404, 29915, 29879, 5155, 470, 11994, 29889, 32007, 29871, 13, 32010, 29871, 13, 29966, 29989, 3027, 29918, 29896, 29989, 29958, 13, 5618, 2924, 310, 15774, 508, 367, 3595, 297, 278, 1967, 29973, 6804, 338, 278, 15774, 5982, 29973, 1815, 366, 8453, 278, 17739, 29915, 29879, 3114, 470, 2779, 29973, 1317, 278, 1967, 2821, 322, 20837, 470, 947, 372, 505, 263, 5412, 7604, 3114, 29973, 32007, 29871, 13, 32001], lora_request: None, prompt_adapter_request: None.
INFO 08-10 04:53:44 async_llm_engine.py:174] Added request chat-e72fd5687baf4e019e4b28c086fc8a7e.
INFO 08-10 04:53:46 metrics.py:406] Avg prompt throughput: 329.7 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 19.8%, CPU KV cache usage: 0.0%.
INFO 08-10 04:53:50 async_llm_engine.py:141] Finished request chat-e72fd5687baf4e019e4b28c086fc8a7e.
INFO:     127.0.0.1:49392 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Client code and outputs

import json
import base64

import requests
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

# Use base64 encoded image in the payload
def encode_image_base64_from_url(image_url: str) -> str:
    """Encode an image retrieved from a remote url to base64 format."""

    with requests.get(image_url) as response:
        response.raise_for_status()
        result = base64.b64encode(response.content).decode('utf-8')

    return result

with open("data.json", "r") as f:
    data = json.load(f)

SYSTEM_PROMPT = "You are an artificial intelligence assistant that gives helpful answers to the user's questions or instructions."
for idx in range(len(data)):
    prompt = data[idx]["prompt"]
    image = data[idx]["image"]

    pload = {
        "model": model,
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": [
                {
                    "type": "text",
                    "text": prompt,
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image}",
                    },
                },
            ]},
        ],
        "stream": False,
        "max_tokens": 1024,
        "temperature": 0.8,
        "top_p": 0.95,
        "stop": ["\nUser:", "<|endoftext|>", "</s>"],
    }
    chat_completion_from_base64 = client.chat.completions.create(**pload)

    result = chat_completion_from_base64.choices[0].message.content
    print(f"Chat {idx} completion output:{result}\n")
Chat 0 completion output: The image displays a variety of fruits, including oranges, apples, and a bunch of bananas. These fruits are placed on a counter or a kitchen countertop, giving a homely and rustic feel to the scene. The photograph features a unique visual style with its colorful and messy appearance, giving it a lively and dynamic quality.

Chat 1 completion output: In the image, there are two bear cubs. The color of the bear cubs is brown, resembling a mix of grey and brown. The setting of the image is a wooded area, likely a forest or a wildlife habitat. The bear cubs are exploring and investigating their surroundings, walking through the forest together. It is indeed usual for young bears to explore their surroundings, as it helps them learn about their environment, find food, and develop essential survival skills.

Chat 2 completion output: There are two dogs in the image. One of the dogs is actively herding the sheep, while the other dog is resting. The man, who is a farmer, is bending over and walking in front of the sheep. The farmer is wearing a hat and a jacket, indicating that they are dressed appropriately for the activity. The dogs, likely Border Collies, are assisting the farmer in herding the sheep, ensuring they are organized and moving in the desired direction.

Chat 3 completion output: The image is a black and white photograph of a street corner. It was taken in The Netherlands, as indicated by the photographer's credit. The street in the image is quiet, with few visible people, and there are no cars or other vehicles present. The street signs are bilingual, with both English and German on them, which suggests that the location might be in a region where both languages are spoken. The signs are attached to a pole on the corner of the street. The overall atmosphere of the street appears to be calm and peaceful, with no visible signs of activity or hustle and bustle. The function of the street signs is to provide direction and information to pedestrians and motorists, helping them navigate the area more effectively.

Chat 4 completion output: The man in the image is wearing a white shirt and a black tie.

Chat 5 completion output: The primary activity taking place on the beach in the image is kite flying. The condition of the kite's tail appears to be intact, with no visible tears or damage. The boy in the image appears to be holding and controlling the kite, with his shadow visible on the ground. Some benefits of kite flying as an outdoor activity include being a fun and engaging form of exercise, improving hand-eye coordination, promoting relaxation and stress relief, fostering creativity and imagination, and strengthening connections with nature. Additionally, kite flying is a social activity that can be enjoyed with friends, family, or even strangers, encouraging interaction and bonding between participants.

Chat 6 completion output: The image depicts two baseball players, with one player jumping up to catch the ball. The photo is in black and white, giving it a historical and timeless feel. The black and white mood implies a sense of nostalgia and emphasizes the raw emotion of the moment captured in the image. 

Black and white photography in relation to sports like baseball dates back to the early 20th century, when color photography was not yet widespread or commonly used. This type of photography was widely popular during the 1920s and 1930s and was considered the standard for sports photography. It helped showcase the intensity, action, and excitement of sports games, capturing a unique atmosphere that is still appreciated today.

Through black and white photography, sports images often convey a sense of timelessness, highlighting the passion and dedication of athletes, as well as the intensity and excitement of the sport itself. The monochrome aesthetic emphasizes the contrast and the essential elements of the scene, drawing focus to the subjects and the athleticism displayed by the players.

Chat 7 completion output: There are two small boats visible in the image. The boats are situated on the shore of the body of water, with the front boat closer to the water's edge and the rear boat farther from the water's edge. Behind the boats, you can see a picturesque landscape consisting of a large, blue, mountainous island, a lush green forested shore, and a cloudy sky.

With these small boats, one can engage in various recreational activities such as fishing, exploring the surrounding water, boating on calm waters, or simply enjoying the view. The serene environment and the natural beauty of the mountainous island make this a perfect spot for relaxation and leisure activities.

Chat 8 completion output: The main action in the image is a baseball player catching a fly ball. The player is wearing a catcher's mitt, which is designed for this specific purpose. The player is a member of the Dodgers baseball team, as indicated by the white uniform. To successfully catch a fly ball, a baseball player needs a combination of skills, including hand-eye coordination, agility, speed, and quick reflexes. In addition, having a strong understanding of the game, anticipating the ball's trajectory, and communicating with teammates are essential aspects of a player's role in catching a fly ball.

Chat 9 completion output: The traffic light in the image is red, and there are signs located near the traffic light. The photo of the traffic light is taken at nighttime. The image depicts a busy street, as evidenced by the presence of multiple vehicles in the background. The signs below the stoplight might be confusing due to their low visibility or unclear symbols, which can cause confusion for drivers and pedestrians.

@jaywonchung
Copy link
Contributor

Actually, if I throw requests one at a time (or at most two at a time), things go well. The problem happens for me when I simultaneously throw all ten requests to the server.

@Isotr0py
Copy link
Collaborator

Seems that the problem is caused by the image_sizes batching:

image_sizes: tensor([[1344, 1008],
        [1008, 1344]], device='cuda:0') torch.Size([2, 2])
ERROR 08-10 06:36:15 async_llm_engine.py:61] Engine background task failed
ERROR 08-10 06:36:15 async_llm_engine.py:61] Traceback (most recent call last):
ERROR 08-10 06:36:15 async_llm_engine.py:61]   File "/kaggle/working/vllm/vllm/engine/async_llm_engine.py", line 51, in _log_task_completion
ERROR 08-10 06:36:15 async_llm_engine.py:61]     return_value = task.result()
ERROR 08-10 06:36:15 async_llm_engine.py:61]   File "/kaggle/working/vllm/vllm/engine/async_llm_engine.py", line 772, in run_engine_loop
ERROR 08-10 06:36:15 async_llm_engine.py:61]     result = task.result()
ERROR 08-10 06:36:15 async_llm_engine.py:61]   File "/kaggle/working/vllm/vllm/engine/async_llm_engine.py", line 715, in engine_step
ERROR 08-10 06:36:15 async_llm_engine.py:61]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 08-10 06:36:15 async_llm_engine.py:61]   File "/kaggle/working/vllm/vllm/engine/async_llm_engine.py", line 282, in step_async
ERROR 08-10 06:36:15 async_llm_engine.py:61]     output = await self.model_executor.execute_model_async(
ERROR 08-10 06:36:15 async_llm_engine.py:61]   File "/kaggle/working/vllm/vllm/executor/gpu_executor.py", line 160, in execute_model_async
ERROR 08-10 06:36:15 async_llm_engine.py:61]     output = await make_async(self.driver_worker.execute_model
ERROR 08-10 06:36:15 async_llm_engine.py:61]   File "/opt/conda/envs/vllm/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 08-10 06:36:15 async_llm_engine.py:61]     result = self.fn(*self.args, **self.kwargs)
ERROR 08-10 06:36:15 async_llm_engine.py:61]   File "/kaggle/working/vllm/vllm/worker/worker_base.py", line 282, in execute_model
ERROR 08-10 06:36:15 async_llm_engine.py:61]     output = self.model_runner.execute_model(
ERROR 08-10 06:36:15 async_llm_engine.py:61]   File "/opt/conda/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 08-10 06:36:15 async_llm_engine.py:61]     return func(*args, **kwargs)
ERROR 08-10 06:36:15 async_llm_engine.py:61]   File "/kaggle/working/vllm/vllm/worker/model_runner.py", line 1538, in execute_model
ERROR 08-10 06:36:15 async_llm_engine.py:61]     hidden_or_intermediate_states = model_executable(
ERROR 08-10 06:36:15 async_llm_engine.py:61]   File "/opt/conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 08-10 06:36:15 async_llm_engine.py:61]     return self._call_impl(*args, **kwargs)
ERROR 08-10 06:36:15 async_llm_engine.py:61]   File "/opt/conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 08-10 06:36:15 async_llm_engine.py:61]     return forward_call(*args, **kwargs)
ERROR 08-10 06:36:15 async_llm_engine.py:61]   File "/kaggle/working/vllm/vllm/model_executor/models/phi3v.py", line 530, in forward
ERROR 08-10 06:36:15 async_llm_engine.py:61]     vision_embeddings = self.vision_embed_tokens(
ERROR 08-10 06:36:15 async_llm_engine.py:61]   File "/opt/conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 08-10 06:36:15 async_llm_engine.py:61]     return self._call_impl(*args, **kwargs)
ERROR 08-10 06:36:15 async_llm_engine.py:61]   File "/opt/conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 08-10 06:36:15 async_llm_engine.py:61]     return forward_call(*args, **kwargs)
ERROR 08-10 06:36:15 async_llm_engine.py:61]   File "/kaggle/working/vllm/vllm/model_executor/models/phi3v.py", line 166, in forward
ERROR 08-10 06:36:15 async_llm_engine.py:61]     image_features_proj = self.hd_feature_transform(
ERROR 08-10 06:36:15 async_llm_engine.py:61]   File "/kaggle/working/vllm/vllm/model_executor/models/phi3v.py", line 220, in hd_feature_transform
ERROR 08-10 06:36:15 async_llm_engine.py:61]     torch.stack(all_image_embeddings).to(target_device, target_dtype)
ERROR 08-10 06:36:15 async_llm_engine.py:61] RuntimeError: stack expects each tensor to be equal size, but got [1933, 4096] at entry 0 and [1921, 4096] at entry 1

@Isotr0py
Copy link
Collaborator

@pseudotensor @jaywonchung I have created #7392 to fix this.

@jaywonchung
Copy link
Contributor

Confirming that the PR fixes this issue. Thanks a lot @Isotr0py!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants