Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tmp Directory Locked #2675

Closed
sidjha1 opened this issue Jan 30, 2024 · 5 comments
Closed

Tmp Directory Locked #2675

sidjha1 opened this issue Jan 30, 2024 · 5 comments

Comments

@sidjha1
Copy link

sidjha1 commented Jan 30, 2024

When multiple users are using vLLM on the same machine, we get the following permission denied error regarding a .lock file

Permission denied: '/tmp/meta-llama-Llama-2-70b-chat-hf.lock

This was also mentioned in #2232 and #2179.

@asimmunawar
Copy link

You can add this to resolve the lock issue
--download-dir "LOCAL-PATH"

@Mor-Li
Copy link

Mor-Li commented Jan 31, 2024

Same Problem, I think attention is needed to solve this problem.This problem seems to appear randomly!
When I run this script

from vllm import LLM, SamplingParams
prompts = [
    "<s><|im_start|>user\nHello, my name is<|im_end|>\n<|im_start|>assistant\n",
    "<s><|im_start|>user\nThe president of the United States is<|im_end|>\n<|im_start|>assistant\n",
    "<s><|im_start|>user\nThe capital of France is<|im_end|>\n<|im_start|>assistant\n",
    "<s><|im_start|>user\nThe future of AI is<|im_end|>\n<|im_start|>assistant\n",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="WizardLM/WizardLM-70B-V1.0", trust_remote_code=True,
          tensor_parallel_size=4,)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

The outputs will be sometimes like this

INFO 01-31 14:22:40 llm_engine.py:70] Initializing an LLM engine with config: model='WizardLM/WizardLM-70B-V1.0', tokenizer='WizardLM/WizardLM-70B-V1.0', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16[174/1901]
len=4096, download_dir=None, load_format=auto, tensor_parallel_size=4, quantization=None, enforce_eager=False, seed=0)
Traceback (most recent call last):
  File "/mnt/hwfile/limo/opencompass_fork/configs/needleinahaystack/wizard_debug.py", line 10, in <module>
    llm = LLM(model="WizardLM/WizardLM-70B-V1.0", trust_remote_code=True,
  File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 105, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 309, in from_engine_args
    engine = cls(*engine_configs,
  File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 109, in __init__
    self._init_workers_ray(placement_group)
  File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 249, in _init_workers_ray
    self._run_workers(
  File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 795, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/worker/worker.py", line 81, in load_model
    self.model_runner.load_model()
  File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 64, in load_model
    self.model = get_model(self.model_config)
  File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/model_executor/model_loader.py", line 72, in get_model
    model.load_weights(model_config.model, model_config.download_dir,
  File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 313, in load_weights
    for name, loaded_weight in hf_model_weights_iterator(
  File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/model_executor/weight_utils.py", line 198, in hf_model_weights_iterator
    hf_folder, hf_weights_files, use_safetensors = prepare_hf_model_weights(
  File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/vllm/model_executor/weight_utils.py", line 154, in prepare_hf_model_weights
    with get_lock(model_name_or_path, cache_dir):
  File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/filelock/_api.py", line 297, in __enter__
    self.acquire()
  File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/filelock/_api.py", line 255, in acquire
    self._acquire()
  File "/mnt/petrelfs/limo/miniconda3/envs/opencompass_fork/lib/python3.10/site-packages/filelock/_unix.py", line 39, in _acquire
    fd = os.open(self.lock_file, open_flags, self._context.mode)
PermissionError: [Errno 13] Permission denied: '/tmp/WizardLM-WizardLM-70B-V1.0.lock'

sometimes like this:

(opencompass_fork) [limo@HOST-10-140-60-209 opencompass_fork]$ srun -p llm_dev2 --quotatype=auto --gres=gpu:4 -N1 -u python3 configs/needleinahaystack/wizard_debug.py
srun: job 3348453 queued and waiting for resources
srun: job 3348453 has been allocated resources
srun: Job 3348453 scheduled successfully!
Current QUOTA_TYPE is [reserved], which means the job has occupied quota in RESERVED_TOTAL under your partition.
Current PHX_PRIORITY is normal

2024-01-31 14:29:37,548 INFO worker.py:1724 -- Started a local Ray instance.
INFO 01-31 14:29:42 llm_engine.py:70] Initializing an LLM engine with config: model='WizardLM/WizardLM-70B-V1.0', tokenizer='WizardLM/WizardLM-70B-V1.0', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq$
len=4096, download_dir=None, load_format=auto, tensor_parallel_size=4, quantization=None, enforce_eager=False, seed=0)
INFO 01-31 14:33:36 llm_engine.py:275] # GPU blocks: 30322, # CPU blocks: 3276
INFO 01-31 14:33:38 model_runner.py:501] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-31 14:33:38 model_runner.py:505] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
(RayWorkerVllm pid=102741) INFO 01-31 14:33:38 model_runner.py:501] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerVllm pid=102741) INFO 01-31 14:33:38 model_runner.py:505] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
(RayWorkerVllm pid=102741) [W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
INFO 01-31 14:34:13 model_runner.py:547] Graph capturing finished in 35 secs.
Processed prompts:   0%|          | 0/4 [00:00<?, ?it/s](RayWorkerVllm pid=102741) INFO 01-31 14:34:13 model_runner.py:547] Graph capturing finished in 35 secs.
(RayWorkerVllm pid=103012) INFO 01-31 14:33:38 model_runner.py:501] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. 
[repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(RayWorkerVllm pid=103012) INFO 01-31 14:33:38 model_runner.py:505] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. [repeated 2x across cluster]
Processed prompts: 100%|██████████| 4/4 [00:00<00:00,  6.71it/s]
Prompt: '<s><|im_start|>user\nHello, my name is<|im_end|>\n<|im_start|>assistant\n', Generated text: "Hi there! I'm here to assist you with any questions or tasks you"
Prompt: '<s><|im_start|>user\nThe president of the United States is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'The president of the United States is currently Joe Biden. He was inaugurated'
Prompt: '<s><|im_start|>user\nThe capital of France is<|im_end|>\n<|im_start|>assistant\n', Generated text: 'Paris\n<|im_end|>'
Prompt: '<s><|im_start|>user\nThe future of AI is<|im_end|>\n<|im_start|>assistant\n', Generated text: '<|im_end|>'
(RayWorkerVllm pid=103012) INFO 01-31 14:34:13 model_runner.py:547] Graph capturing finished in 35 secs. [repeated 2x across cluster]
(RayWorkerVllm pid=103012) [W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator()) [repeated 2x across cluster]

@sidjha1
Copy link
Author

sidjha1 commented Feb 3, 2024

Specifying the download directory worked. Thanks!

@sidjha1 sidjha1 closed this as completed Feb 3, 2024
@TanmayParekh
Copy link

@sidjha1 How and where did you specify the download directory? I don't see any argument like this for the LLM class here - https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py

@sidjha1
Copy link
Author

sidjha1 commented Feb 9, 2024

Hey @TanmayParekh, I'm putting the vLLM quickstart with the download_dir specified below. IIRC, the parameter becomes part of the kwargs so it is not explicitly mentioned in the parameter list.

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="facebook/opt-125m", download_dir="vllm-download-dir")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants