[Bug]: AssertionError: libcuda.so cannot found with vllm/vllm-openai:v0.4.0 #3808

noahshpak · 2024-04-03T00:02:26Z

Your current environment

Running in Kubernetes on H100 in vllm/vllm-openai:v0.4.0

🐛 Describe the bug

Seems like there have been some weird dependency issues since v0.2.7. We would love to use flash attention when you are able to fix this! Thank you! 🙇

INFO 04-02 23:55:08 api_server.py:148] vLLM API server version 0.4.0
INFO 04-02 23:55:08 api_server.py:149] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=True, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='mistralai/Mixtral-8x7B-v0.1', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, seed=0, swap_space=4, gpu_memory_utilization=0.9, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-04-02 23:55:10,082	INFO worker.py:1752 -- Started a local Ray instance.
INFO 04-02 23:55:10 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='mistralai/Mixtral-8x7B-v0.1', tokenizer='mistralai/Mixtral-8x7B-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 04-02 23:55:16 selector.py:16] Using FlashAttention backend.
(RayWorkerVllm pid=2375) INFO 04-02 23:55:16 selector.py:16] Using FlashAttention backend.
INFO 04-02 23:55:18 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=2375) INFO 04-02 23:55:18 pynccl_utils.py:45] vLLM is using nccl==2.18.1
INFO 04-02 23:55:28 weight_utils.py:177] Using model weights format ['*.safetensors']
(RayWorkerVllm pid=2375) INFO 04-02 23:55:29 weight_utils.py:177] Using model weights format ['*.safetensors']
INFO 04-02 23:55:39 model_runner.py:104] Loading model weights took 43.5064 GB
(RayWorkerVllm pid=2375) INFO 04-02 23:55:45 model_runner.py:104] Loading model weights took 43.5064 GB
INFO 04-02 23:55:50 fused_moe.py:272] Using configuration from /workspace/vllm/model_executor/layers/fused_moe/configs/E=8,N=7168,device_name=NVIDIA_H100_80GB_HBM3.json for MoE layer.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/vllm/entrypoints/openai/api_server.py", line 156, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/workspace/vllm/engine/async_llm_engine.py", line 348, in from_engine_args
    engine = cls(
  File "/workspace/vllm/engine/async_llm_engine.py", line 311, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/workspace/vllm/engine/async_llm_engine.py", line 422, in _init_engine
    return engine_class(*args, **kwargs)
  File "/workspace/vllm/engine/llm_engine.py", line 111, in __init__
    self.model_executor = executor_class(model_config, cache_config,
  File "/workspace/vllm/executor/ray_gpu_executor.py", line 65, in __init__
    self._init_cache()
  File "/workspace/vllm/executor/ray_gpu_executor.py", line 220, in _init_cache
    num_blocks = self._run_workers(
  File "/workspace/vllm/executor/ray_gpu_executor.py", line 324, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/worker.py", line 131, in profile_num_available_blocks
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/model_runner.py", line 742, in profile_run
    self.execute_model(seqs, kv_caches)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/model_runner.py", line 663, in execute_model
    hidden_states = model_executable(**execute_model_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mixtral.py", line 379, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mixtral.py", line 315, in forward
    hidden_states, residual = layer(positions, hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mixtral.py", line 274, in forward
    hidden_states = self.block_sparse_moe(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mixtral.py", line 128, in forward
    final_hidden_states = fused_moe(hidden_states,
  File "/workspace/vllm/model_executor/layers/fused_moe/fused_moe.py", line 397, in fused_moe
    invoke_fused_moe_kernel(hidden_states, w1, intermediate_cache1,
  File "/workspace/vllm/model_executor/layers/fused_moe/fused_moe.py", line 222, in invoke_fused_moe_kernel
    fused_moe_kernel[grid](
  File "<string>", line 63, in fused_moe_kernel
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 425, in compile
    so_path = make_stub(name, signature, constants)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/make_launcher.py", line 39, in make_stub
    so = _build(name, src_path, tmpdir)
  File "/usr/local/lib/python3.10/dist-packages/triton/common/build.py", line 61, in _build
    cuda_lib_dirs = libcuda_dirs()
  File "/usr/local/lib/python3.10/dist-packages/triton/common/build.py", line 30, in libcuda_dirs
    assert any(os.path.exists(os.path.join(path, 'libcuda.so')) for path in dirs), msg
AssertionError: libcuda.so cannot found!

FlorianJoncour · 2024-04-04T09:18:18Z

Check / set your CUDA_PATH environment variable

hellangleZ · 2024-04-11T16:36:55Z

Your current environment

Running in Kubernetes on H100 in vllm/vllm-openai:v0.4.0

🐛 Describe the bug

Seems like there have been some weird dependency issues since v0.2.7. We would love to use flash attention when you are able to fix this! Thank you! 🙇

INFO 04-02 23:55:08 api_server.py:148] vLLM API server version 0.4.0
INFO 04-02 23:55:08 api_server.py:149] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=True, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='mistralai/Mixtral-8x7B-v0.1', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, seed=0, swap_space=4, gpu_memory_utilization=0.9, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-04-02 23:55:10,082	INFO worker.py:1752 -- Started a local Ray instance.
INFO 04-02 23:55:10 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='mistralai/Mixtral-8x7B-v0.1', tokenizer='mistralai/Mixtral-8x7B-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 04-02 23:55:16 selector.py:16] Using FlashAttention backend.
(RayWorkerVllm pid=2375) INFO 04-02 23:55:16 selector.py:16] Using FlashAttention backend.
INFO 04-02 23:55:18 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=2375) INFO 04-02 23:55:18 pynccl_utils.py:45] vLLM is using nccl==2.18.1
INFO 04-02 23:55:28 weight_utils.py:177] Using model weights format ['*.safetensors']
(RayWorkerVllm pid=2375) INFO 04-02 23:55:29 weight_utils.py:177] Using model weights format ['*.safetensors']
INFO 04-02 23:55:39 model_runner.py:104] Loading model weights took 43.5064 GB
(RayWorkerVllm pid=2375) INFO 04-02 23:55:45 model_runner.py:104] Loading model weights took 43.5064 GB
INFO 04-02 23:55:50 fused_moe.py:272] Using configuration from /workspace/vllm/model_executor/layers/fused_moe/configs/E=8,N=7168,device_name=NVIDIA_H100_80GB_HBM3.json for MoE layer.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/vllm/entrypoints/openai/api_server.py", line 156, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/workspace/vllm/engine/async_llm_engine.py", line 348, in from_engine_args
    engine = cls(
  File "/workspace/vllm/engine/async_llm_engine.py", line 311, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/workspace/vllm/engine/async_llm_engine.py", line 422, in _init_engine
    return engine_class(*args, **kwargs)
  File "/workspace/vllm/engine/llm_engine.py", line 111, in __init__
    self.model_executor = executor_class(model_config, cache_config,
  File "/workspace/vllm/executor/ray_gpu_executor.py", line 65, in __init__
    self._init_cache()
  File "/workspace/vllm/executor/ray_gpu_executor.py", line 220, in _init_cache
    num_blocks = self._run_workers(
  File "/workspace/vllm/executor/ray_gpu_executor.py", line 324, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/worker.py", line 131, in profile_num_available_blocks
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/model_runner.py", line 742, in profile_run
    self.execute_model(seqs, kv_caches)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/model_runner.py", line 663, in execute_model
    hidden_states = model_executable(**execute_model_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mixtral.py", line 379, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mixtral.py", line 315, in forward
    hidden_states, residual = layer(positions, hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mixtral.py", line 274, in forward
    hidden_states = self.block_sparse_moe(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mixtral.py", line 128, in forward
    final_hidden_states = fused_moe(hidden_states,
  File "/workspace/vllm/model_executor/layers/fused_moe/fused_moe.py", line 397, in fused_moe
    invoke_fused_moe_kernel(hidden_states, w1, intermediate_cache1,
  File "/workspace/vllm/model_executor/layers/fused_moe/fused_moe.py", line 222, in invoke_fused_moe_kernel
    fused_moe_kernel[grid](
  File "<string>", line 63, in fused_moe_kernel
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 425, in compile
    so_path = make_stub(name, signature, constants)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/make_launcher.py", line 39, in make_stub
    so = _build(name, src_path, tmpdir)
  File "/usr/local/lib/python3.10/dist-packages/triton/common/build.py", line 61, in _build
    cuda_lib_dirs = libcuda_dirs()
  File "/usr/local/lib/python3.10/dist-packages/triton/common/build.py", line 30, in libcuda_dirs
    assert any(os.path.exists(os.path.join(path, 'libcuda.so')) for path in dirs), msg
AssertionError: libcuda.so cannot found!

I have same issue, how you to fix it?

Corleno · 2024-04-14T22:25:24Z

You may want to run your job on docker with specified cuda 11.8 or 12.1

umid-podo · 2024-04-17T08:40:31Z

same issue with me.

I ran the mixtral-8x7B model using vllm/vllm-openai:v0.4.0 on k8s with H100.

umid-podo · 2024-04-18T08:20:46Z

I've tested multiple versions.

This issue didn't occur until version 0.2.7, but it started happening from version 0.3.0 onwards.
From examining the error logs, it seems there's an issue with the changed part in version 0.3.0.

youkaichao · 2024-04-21T06:36:02Z

This is not a problem of vllm, but a problem that triton cannot find libcuda.so .

In general, libcuda.so comes with nvidia drivers. And driver is usually configed by physical machine rather than containers. I would suggest reaching out to your k8s admin, to ask them where do they put/mount the nvidia driver inside containers.

github-actions · 2024-10-29T02:02:45Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

noahshpak added the bug Something isn't working label Apr 3, 2024

github-actions bot added the stale label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: AssertionError: libcuda.so cannot found with vllm/vllm-openai:v0.4.0 #3808

[Bug]: AssertionError: libcuda.so cannot found with vllm/vllm-openai:v0.4.0 #3808

noahshpak commented Apr 3, 2024

FlorianJoncour commented Apr 4, 2024

hellangleZ commented Apr 11, 2024

Your current environment

🐛 Describe the bug

Corleno commented Apr 14, 2024

umid-podo commented Apr 17, 2024 •

edited

Loading

umid-podo commented Apr 18, 2024 •

edited

Loading

youkaichao commented Apr 21, 2024

github-actions bot commented Oct 29, 2024

[Bug]: AssertionError: libcuda.so cannot found with vllm/vllm-openai:v0.4.0 #3808

[Bug]: AssertionError: libcuda.so cannot found with vllm/vllm-openai:v0.4.0 #3808

Comments

noahshpak commented Apr 3, 2024

Your current environment

🐛 Describe the bug

FlorianJoncour commented Apr 4, 2024

hellangleZ commented Apr 11, 2024

Your current environment

🐛 Describe the bug

Corleno commented Apr 14, 2024

umid-podo commented Apr 17, 2024 • edited Loading

umid-podo commented Apr 18, 2024 • edited Loading

youkaichao commented Apr 21, 2024

github-actions bot commented Oct 29, 2024

umid-podo commented Apr 17, 2024 •

edited

Loading

umid-podo commented Apr 18, 2024 •

edited

Loading