Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] use SoftLockFile instead of LockFile #3578

Merged
merged 1 commit into from
Mar 23, 2024

Conversation

kota-iizuka
Copy link
Contributor

@kota-iizuka kota-iizuka commented Mar 23, 2024

filelock.LockFile does not delete the .lock file it created. Use filelock.SoftLockFile instead to remove .lock when it is no longer needed. This prevents another user from referencing an unused .lock file and causing an error.

FIX #2232 #2179 #2675

sample code

The following script will delete the a.lock file created during execution after 5 seconds and finish execution.

Changing SoftLockFile to LockFile leaves the a.lock file undeleted on exit

import filelock
import time

with filelock.SoftFileLock("a.lock"):
    time.sleep(5)

Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!

@simon-mo simon-mo merged commit 743a0b7 into vllm-project:main Mar 23, 2024
32 checks passed
@youkaichao
Copy link
Member

What if the program crashes and the .lock file is not deleted? It will cause the program to be in deadlock, and cannot be started again :(

@Yard1
Copy link
Collaborator

Yard1 commented Mar 23, 2024

I don't quite understand this change, with a FileLock the mere existence of a file is not indicative of the Lock being acquired (it uses OS level locking). SoftFileLock is dangerous for the reasons @youkaichao pointed out. Cc @simon-mo

The error seems to be caused by permissions (the lock file is only accessible by the current user). I think there should be a setting in FileLock to change that - if not, it should be quite easy to add this logic in a wrapper. Or just have the lock file be scoped to the user home directory.

@simon-mo
Copy link
Collaborator

I mean semantically they are the same except Soft gives you a nicer cleanup?

@youkaichao
Copy link
Member

youkaichao commented Mar 23, 2024

FileLock: file already exists, multiple processes get fd from OS, and OS gives the lock to one process. Therefore cannot be shared by two processes from two users. After the process releases the lock, OS simply delete the fd.

SoftFileLock: locks on the file existence. require the non-existence before acquiring the lock, and must delete the file after releasing the lock. If any of the requirement is broken (e.g. program crash), deadlock happens (because the file is still there, later processes cannot get the lock).

In my opinion, a better way for allowing multiple users would be #3587 . When each user launches vllm, the files they created can be prefixed by launch_id. Therefore, different users don't intervene with each other.

@kota-iizuka
Copy link
Contributor Author

kota-iizuka commented Mar 24, 2024

What if the program crashes and the .lock file is not deleted? It will cause the program to be in deadlock, and cannot be started again :(

I had overlooked that. By using the SoftFileLock(timeout=0) option, you can immediately exit with an error when a lock file exists. I will resubmit this fix.

$ python lock_sample.py 
^CTraceback (most recent call last):
  File "/work/lock_sample.py", line 4, in <module>
    with filelock.SoftFileLock("a.lock"):
  File "/work/.venv/lib/python3.10/site-packages/filelock/_api.py", line 297, in __enter__
    self.acquire()
  File "/work/.venv/lib/python3.10/site-packages/filelock/_api.py", line 267, in acquire
    time.sleep(poll_interval)
KeyboardInterrupt

$ python lock_sample_with_timeout.py 
Traceback (most recent call last):
  File "/work/lock_sample_with_timeout.py", line 4, in <module>
    with filelock.SoftFileLock("a.lock", timeout=0):
  File "/work/.venv/lib/python3.10/site-packages/filelock/_api.py", line 297, in __enter__
    self.acquire()
  File "/work/.venv/lib/python3.10/site-packages/filelock/_api.py", line 264, in acquire
    raise Timeout(lock_filename)  # noqa: TRY301
filelock._error.Timeout: The file lock 'a.lock' could not be acquired.

@youkaichao
Copy link
Member

Not sure if it is called per-process or per-vllm-launch. @simon-mo can confirm this.

If SoftFileLock(timeout=0) is called by multiple processes in one vllm launch, then it will immediately crash because only one process can get the lock while the rest processes all get an error.

@kota-iizuka
Copy link
Contributor Author

called by multiple processes in one vllm launch

I understand your point. This means that there is a correct way to use get_lock() being called multiple times by the same user.

You should be able to avoid conflicts by rename lock file as {model_name}_{process_id}.lock.

@youkaichao
Copy link
Member

{model_name}_{process_id}.lock is not good. Different processes lock on their own file, and they all think they get the lock, then modifying the directory at the same time ...

@kota-iizuka
Copy link
Contributor Author

In summary, I think the current implementation using SoftLockFile is OK. As pointed out, there are cases where the program exits with the .lock file still present, but this should only happen during development, in which case you can simply delete the .lock file manually. This is limited compared to multiple users using the vllm on the same server.

@youkaichao
Copy link
Member

I suggest putting the .lock file to /tmp, which will be cleaned up by the OS. That would lead to less deadlock.

cc @simon-mo

@WoosukKwon
Copy link
Collaborator

I got a deadlock on the main branch after this PR is merged. The bug didn't appear in the commit just before this PR.

@kota-iizuka
Copy link
Contributor Author

kota-iizuka commented Mar 24, 2024

@WoosukKwon In environments where previous code was executed, if a .lock file remains, a deadlock should occur. Please delete the .lock file and try again

If the error occurs in a new environment, I would like information on how to reproduce it.

@kota-iizuka
Copy link
Contributor Author

It may be better to use filelock.FileLock(mode=666), which can be applied directly to the situation where there is currently a .lock file.

@WoosukKwon
Copy link
Collaborator

@kota-iizuka Which file should I exactly delete?

Also, I believe we shouldn't break any users who have already been using vLLM.

@kota-iizuka
Copy link
Contributor Author

kota-iizuka commented Mar 24, 2024

@WoosukKwon

Which file should I exactly delete?

model_name_or_path.replace("/", "-") + ".lock" as written in the code

Also, I believe we shouldn't break any users who have already been using vLLM.

My interpretation is that the previous one was broken.

@youkaichao
Copy link
Member

youkaichao commented Mar 24, 2024

I got a deadlock on the main branch after this PR is merged. The bug didn't appear in the commit just before this PR.

I realized this PR will break all existing vllm users. Because we use FileLock before, all users will have .lock files already. Then if we change to SoftFileLock, it requires the absence of .lock file, so it will deadlock for all users.

@kota-iizuka
Copy link
Contributor Author

kota-iizuka commented Mar 25, 2024

I realized this PR will break all existing vllm users. Because we use FileLock before, all users will have .lock files already. Then if we change to SoftFileLock, it requires the absence of .lock file, so it will deadlock for all users.

Yes, but the previous code would have to delete the .lock every time it was used by another user. In that sense, the conditions are the same.

youkaichao added a commit that referenced this pull request Mar 25, 2024
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The same model cannot be loaded by two different users
5 participants