-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tmp Directory Locked #2675
Comments
You can add this to resolve the lock issue |
Same Problem, I think attention is needed to solve this problem.This problem seems to appear randomly! from vllm import LLM, SamplingParams
prompts = [
"<s><|im_start|>user\nHello, my name is<|im_end|>\n<|im_start|>assistant\n",
"<s><|im_start|>user\nThe president of the United States is<|im_end|>\n<|im_start|>assistant\n",
"<s><|im_start|>user\nThe capital of France is<|im_end|>\n<|im_start|>assistant\n",
"<s><|im_start|>user\nThe future of AI is<|im_end|>\n<|im_start|>assistant\n",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="WizardLM/WizardLM-70B-V1.0", trust_remote_code=True,
tensor_parallel_size=4,)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") The outputs will be sometimes like this INFO 01-31 14:22:40 llm_engine.py:70] Initializing an LLM engine with config: model='WizardLM/WizardLM-70B-V1.0', tokenizer='WizardLM/WizardLM-70B-V1.0', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16[174/1901]
len=4096, download_dir=None, load_format=auto, tensor_parallel_size=4, quantization=None, enforce_eager=False, seed=0)
Traceback (most recent call last):
File "/mnt/hwfile/limo/opencompass_fork/configs/needleinahaystack/wizard_debug.py", line 10, in <module>
llm = LLM(model="WizardLM/WizardLM-70B-V1.0", trust_remote_code=True,
File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 105, in __init__
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 309, in from_engine_args
engine = cls(*engine_configs,
File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 109, in __init__
self._init_workers_ray(placement_group)
File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 249, in _init_workers_ray
self._run_workers(
File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 795, in _run_workers
driver_worker_output = getattr(self.driver_worker,
File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/worker/worker.py", line 81, in load_model
self.model_runner.load_model()
File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 64, in load_model
self.model = get_model(self.model_config)
File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/model_executor/model_loader.py", line 72, in get_model
model.load_weights(model_config.model, model_config.download_dir,
File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 313, in load_weights
for name, loaded_weight in hf_model_weights_iterator(
File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/model_executor/weight_utils.py", line 198, in hf_model_weights_iterator
hf_folder, hf_weights_files, use_safetensors = prepare_hf_model_weights(
File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/model_executor/weight_utils.py", line 154, in prepare_hf_model_weights
with get_lock(model_name_or_path, cache_dir):
File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/filelock/_api.py", line 297, in __enter__
self.acquire()
File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/filelock/_api.py", line 255, in acquire
self._acquire()
File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/filelock/_unix.py", line 39, in _acquire
fd = os.open(self.lock_file, open_flags, self._context.mode)
PermissionError: [Errno 13] Permission denied: '/tmp/WizardLM-WizardLM-70B-V1.0.lock' sometimes like this: (opencompass_fork) [limo@HOST-10-140-60-209 opencompass_fork]$ srun -p llm_dev2 --quotatype=auto --gres=gpu:4 -N1 -u python3 configs/needleinahaystack/wizard_debug.py
srun: job 3348453 queued and waiting for resources
srun: job 3348453 has been allocated resources
srun: Job 3348453 scheduled successfully!
Current QUOTA_TYPE is [reserved], which means the job has occupied quota in RESERVED_TOTAL under your partition.
Current PHX_PRIORITY is normal
2024-01-31 14:29:37,548 INFO worker.py:1724 -- Started a local Ray instance.
INFO 01-31 14:29:42 llm_engine.py:70] Initializing an LLM engine with config: model='WizardLM/WizardLM-70B-V1.0', tokenizer='WizardLM/WizardLM-70B-V1.0', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq$
len=4096, download_dir=None, load_format=auto, tensor_parallel_size=4, quantization=None, enforce_eager=False, seed=0)
INFO 01-31 14:33:36 llm_engine.py:275] # GPU blocks: 30322, # CPU blocks: 3276
INFO 01-31 14:33:38 model_runner.py:501] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-31 14:33:38 model_runner.py:505] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
(RayWorkerVllm pid=102741) INFO 01-31 14:33:38 model_runner.py:501] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerVllm pid=102741) INFO 01-31 14:33:38 model_runner.py:505] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
(RayWorkerVllm pid=102741) [W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
INFO 01-31 14:34:13 model_runner.py:547] Graph capturing finished in 35 secs.
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s](RayWorkerVllm pid=102741) INFO 01-31 14:34:13 model_runner.py:547] Graph capturing finished in 35 secs.
(RayWorkerVllm pid=103012) INFO 01-31 14:33:38 model_runner.py:501] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
[repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(RayWorkerVllm pid=103012) INFO 01-31 14:33:38 model_runner.py:505] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. [repeated 2x across cluster]
Processed prompts: 100%|██████████| 4/4 [00:00<00:00, 6.71it/s]
Prompt: '<s><|im_start|>user\nHello, my name is<|im_end|>\n<|im_start|>assistant\n', Generated text: "Hi there! I'm here to assist you with any questions or tasks you"
Prompt: '<s><|im_start|>user\nThe president of the United States is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The president of the United States is currently Joe Biden. He was inaugurated'
Prompt: '<s><|im_start|>user\nThe capital of France is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'Paris\n<|im_end|>'
Prompt: '<s><|im_start|>user\nThe future of AI is<|im_end|>\n<|im_start|>assistant\n', Generated text: '<|im_end|>'
(RayWorkerVllm pid=103012) INFO 01-31 14:34:13 model_runner.py:547] Graph capturing finished in 35 secs. [repeated 2x across cluster]
(RayWorkerVllm pid=103012) [W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator()) [repeated 2x across cluster]
|
Specifying the download directory worked. Thanks! |
@sidjha1 How and where did you specify the download directory? I don't see any argument like this for the LLM class here - https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py |
Hey @TanmayParekh, I'm putting the vLLM quickstart with the
|
When multiple users are using vLLM on the same machine, we get the following permission denied error regarding a .lock file
This was also mentioned in #2232 and #2179.
The text was updated successfully, but these errors were encountered: