-
Notifications
You must be signed in to change notification settings - Fork 426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent pullls fail (race condition?) #5020
Comments
This is a known issue that is unfortunately extremely difficult to deal with given Singularity is daemon-less and used across a very large range of network/parallel file-systems that don't necessarily support locking / atomic rename / global cache consistency. We could use some additional techniques here to make it into a less-frequently occurring race, but there would still be a race in various cases (e.g. across nodes using NFS without Because we can't perfectly prevent these races given how Singularity is deployed on clusters, it will take some co-ordination in whatever is calling Singularity (in this case Nextflow) to completely avoid failures. Arguable a very visible failure (like the current situation) is better than something that occurs infrequently and is difficult to pin down. The general advice running things in parallel using containers where filesystems are shared is that you should |
Other options are to bypass the cache with |
Thanks for the fast answer! I was hoping for some fast solution in singularity, like downloading to a temporary location and moving in-place; but I see that you've already considered all the options and atomicity issues... I'll try to work around it in nextflow! |
It'd be great if we could do something, and we'll keep thinking about it. There's an argument to be able to turn on some kind of 'nicer' cache behavior where you are certain you have only local I came back to Sylabs last year from an HPC environment where Nextflow was used, and we had I can't remember if there's an easy way to set a process to only run if Singularity is used in a nextflow workflow, but I'd guess Paolo would be able to answer on the Nextflow Gitter or similar. |
Our users are seeing a similar issue quite frequently. It would be great if this can be fixed. Basically, if 2 instances of singularity tries to start the same container, 1 of them could fail with something like the following error message.
I understand that it's hard to do a true database-ACID grade mutex across different file systems, but I don't think it has to be perfect? I believe |
It is always preferred to You can also set a host/process specific cache directory through the environment, or use the
Right - I don't necessarily think it has to be perfect, but we need to do it in a way that doesn't make things worse for the end user. The problem with addressing that is the sheer number of file-systems to consider, and the fact that many of them may have different atomicity / consistency guarantees based on version or site configuration. It's possible, e.g., that a sporadic failure due to a non-atomic rename race or similar might happen 20 hours into a 24hr job, and that would result overall in a worse experience than hitting this early/often, and adopting the 'pull first' workaround.
On local filesystems this is essentially true. POSIX demands rename is atomic on the client. It doesn't have to be atomic across all clients of the same FS. On NFS which is a very common At least older versions (maybe current - I'm not sure) of Lustre using Distributed Namespace have non-atomic rename, which involves a file copy (so not "very close to atomic").
This is non-trivial for a daemon-less process like Singularity. It involves a timeout to handle situations where the We will work on this, but it will take time so we can ensure we have a plan that will make the situation no worse for things like the |
We can't do docker pull before we run jobs as we don't know which container/tag that our Apps might end up running. They could update tag, or change the container altogether. We could try to guess/detect it, but it would be brittle. I'd like to not disable caching all together. The singularity v3's ability to launch containers like docker run (no wait time) is a huge benefit for us. |
We've just discussed this a bit, and are looking to maybe proceed as follows: Add a global option to disable cache totally (this can be added quickly). Rework the caching to benefit from atomic rename and be safe on local file-systems only, where true atomic rename is possible. Detect network file-system types and print a warning that it is not safe, and may fail. @soichih - regarding your follow-up, this isn't likely to be a complete solution...
I'm curious what type of jobs are running where the container may change after job submission, and that is a requirement rather than a reproducibility concern? It's honestly not something we'd anticipate coming across often, so it'd be good to understand your use case a bit more.
We can fix it for caching to local file-systems only, but there is simply no way to do it on a cluster or with shared file-systems in a daemon-less way or without interprocess communication between instances of singularity, given the lack of guarantees on shared file systems. Some of the strongest benefits of Singularity come from the fact it is daemon-less, and doesn't have complex co-ordination, so these aren't things we'd likely introduce. |
There is another disadvantage of downloading directly to the final file: it leaves broken files if the pull fails midway for some reason (even ctrl+c, which would be preventable by catching signals). Subsequent runs also fail the hash check. Disabling caching globally is not a very attractive option for me. Even the most minimal containers can take a lot of space away when running 1000 execution instances in parallel. The nextflow issue can be hacked around by setting
Not sure how the current behavior helps there - if my first parallel processing with pulling images starts after those 20hrs, it'll crash after those 20hrs and not before. Also, we're more like "crash 10 days into a 14 day job" here ;-) |
Agreed - signal handling has to be tidied up in various places also, but rename will help here.
Yeah - am aware it's only less disruptive if the first parallel step is early and it fails there. There are lots of situations here, and given there is no magic fix we can make generically when using non local-fs there are going to be cases which are nasty, but we can try to warn that they might occur ahead of time, |
If you are feeling adventurous you can try the current This work is ongoing in |
Awesome! Stupid work has me on something else at the moment, but I'll get back to it soon. |
Closing as this is believed fixed in |
Version of Singularity:
What version of Singularity are you using? Run:
Expected behavior
Actual behavior
I think what happens is that the first run creates the cache directory and starts downloading directly to
cache/library/sha256.721a1b94e24ff67342f7531b240fca6d527fe4e25ebb40eef658ea1557a30ba1/samtools_v1.9.sif
. The second run gets the image hash, sees that the cache directory already exists and calculates the hash fromsamtools_v1.9.sif
that it finds there. But this is still incomplete, so the hash check fails.Normally I wouldn't consider this a drastic problem, but I'm using singularity in nextflow pipelines. If I use the same image in multiple processes that are started in parallel, the pipeline fails when starting for the first time (possibly first few times if there are multiple instances of images in parallel processes).
Steps to reproduce this behavior
How did you install Singularity
source
The text was updated successfully, but these errors were encountered: