Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent pullls fail (race condition?) #5020

Closed
h3kker opened this issue Feb 5, 2020 · 13 comments · Fixed by #5088
Closed

Concurrent pullls fail (race condition?) #5020

h3kker opened this issue Feb 5, 2020 · 13 comments · Fixed by #5088

Comments

@h3kker
Copy link

h3kker commented Feb 5, 2020

Version of Singularity:

What version of Singularity are you using? Run:

$ singularity version
singularity version 3.5.2+266-gd3586d1de
# current master

Expected behavior

singularity run library://pipeline/samtools:v1.9 --help
# help output
# concurrently in a second shell
singularity run library://pipeline/samtools:v1.9 --help
# help output

Actual behavior

# second concurrently started run:
FATAL:   Unable to handle library://pipeline/samtools:v1.9 uri: unable to check if /.../.singularity/cache/library/sha256.721a1b94e24ff67342f7531b240fca6d527fe4e25ebb40eef658ea1557a30ba1/samtools_v1.9.sif exists: hash does not match

I think what happens is that the first run creates the cache directory and starts downloading directly to cache/library/sha256.721a1b94e24ff67342f7531b240fca6d527fe4e25ebb40eef658ea1557a30ba1/samtools_v1.9.sif. The second run gets the image hash, sees that the cache directory already exists and calculates the hash from samtools_v1.9.sif that it finds there. But this is still incomplete, so the hash check fails.

Normally I wouldn't consider this a drastic problem, but I'm using singularity in nextflow pipelines. If I use the same image in multiple processes that are started in parallel, the pipeline fails when starting for the first time (possibly first few times if there are multiple instances of images in parallel processes).

Steps to reproduce this behavior

  • open two terminals
  • prepare two singularity run library://.... commands with the same image URI
  • start first
  • while the first is still downloading, start second

How did you install Singularity

source

@dtrudg
Copy link
Contributor

dtrudg commented Feb 5, 2020

This is a known issue that is unfortunately extremely difficult to deal with given Singularity is daemon-less and used across a very large range of network/parallel file-systems that don't necessarily support locking / atomic rename / global cache consistency.

We could use some additional techniques here to make it into a less-frequently occurring race, but there would still be a race in various cases (e.g. across nodes using NFS without actimeo=0 noac mount options etc.).

Because we can't perfectly prevent these races given how Singularity is deployed on clusters, it will take some co-ordination in whatever is calling Singularity (in this case Nextflow) to completely avoid failures. Arguable a very visible failure (like the current situation) is better than something that occurs infrequently and is difficult to pin down.

The general advice running things in parallel using containers where filesystems are shared is that you should pull the containers once, before entering the parallel portion of the run. In Nextflow, without automatic support for this, you could add a task to singularity pull each thing at the start of the workflow so that the containers are pre-cached.

@dtrudg
Copy link
Contributor

dtrudg commented Feb 5, 2020

Other options are to bypass the cache with --disable-cache or set per process / node cache location environment variables etc.

@h3kker
Copy link
Author

h3kker commented Feb 5, 2020

Thanks for the fast answer! I was hoping for some fast solution in singularity, like downloading to a temporary location and moving in-place; but I see that you've already considered all the options and atomicity issues...

I'll try to work around it in nextflow!

@dtrudg
Copy link
Contributor

dtrudg commented Feb 5, 2020

Thanks for the fast answer! I was hoping for some fast solution in singularity, like downloading to a temporary location and moving in-place; but I see that you've already considered all the options and atomicity issues...

I'll try to work around it in nextflow!

It'd be great if we could do something, and we'll keep thinking about it. There's an argument to be able to turn on some kind of 'nicer' cache behavior where you are certain you have only local $HOME or similar that supports locks/atomic rename/whatever solution we develop (not sure we can detect this reliably).

I came back to Sylabs last year from an HPC environment where Nextflow was used, and we had local flock on lustre, plus NFS, Panasas pNFS, and GPFS in use - so am unfortunately well aware of where this complexity hits.

I can't remember if there's an easy way to set a process to only run if Singularity is used in a nextflow workflow, but I'd guess Paolo would be able to answer on the Nextflow Gitter or similar.

@soichih
Copy link
Contributor

soichih commented Feb 6, 2020

Our users are seeing a similar issue quite frequently. It would be great if this can be fixed.

Basically, if 2 instances of singularity tries to start the same container, 1 of them could fail with something like the following error message.

Creating SIF file...
could not open image /N/dc2/scratch/svincibo/singularity-cache/cache/oci-tmp/34f96d505677bb18d831dbc2baae1986de1a8905ec6fadf998c1e2871f0ed741/freesurfer_on_mcr_6.0.2.sif: image format not recognized

I understand that it's hard to do a true database-ACID grade mutex across different file systems, but I don't think it has to be perfect? I believe mv command on most file systems are very close to atomic, so you could simply try caching to a temp file and mv it to the final sif file name? Other instances can detect the .tmp file existence and just sit and wait for the real .sif file to emerge?

@dtrudg
Copy link
Contributor

dtrudg commented Feb 6, 2020

It is always preferred to singularity pull once and then run the parallel step against the SIF image. Generally this is a simple workaround that additionally avoids redundant bandwidth usage and time spent building a SIF.

You can also set a host/process specific cache directory through the environment, or use the --disable-cache option to Singularity to avoid caching altogether.

I understand that it's hard to do a true database-ACID grade mutex across different file systems, but I don't think it has to be perfect?

Right - I don't necessarily think it has to be perfect, but we need to do it in a way that doesn't make things worse for the end user. The problem with addressing that is the sheer number of file-systems to consider, and the fact that many of them may have different atomicity / consistency guarantees based on version or site configuration.

It's possible, e.g., that a sporadic failure due to a non-atomic rename race or similar might happen 20 hours into a 24hr job, and that would result overall in a worse experience than hitting this early/often, and adopting the 'pull first' workaround.

I believe mv command on most file systems are very close to atomic, so you could simply try caching to a temp file and mv it to the final sif file name?

On local filesystems this is essentially true. POSIX demands rename is atomic on the client. It doesn't have to be atomic across all clients of the same FS.

On NFS which is a very common $HOME location for Singularity it is not true. E.g NFS can have very large default attribute cache timeouts leading to files being invisible in an ls on node B many seconds after is was created on node A. There are some workarounds to make things better, but not fix entirely, by forcing stat operations. These are partially fixing an issue for NFS only, though.

At least older versions (maybe current - I'm not sure) of Lustre using Distributed Namespace have non-atomic rename, which involves a file copy (so not "very close to atomic").

Other instances can detect the .tmp file existence and just sit and wait for the real .sif file to emerge?

This is non-trivial for a daemon-less process like Singularity. It involves a timeout to handle situations where the pulling process crashes. But what is an acceptable timeout etc?

We will work on this, but it will take time so we can ensure we have a plan that will make the situation no worse for things like the sporadic failure 20h into a job case I mentioned. I'm wondering if it would be useful for the near term to have an option to disable cache globally in singularity.conf - though this is a blunt instrument?

@soichih
Copy link
Contributor

soichih commented Feb 6, 2020

We can't do docker pull before we run jobs as we don't know which container/tag that our Apps might end up running. They could update tag, or change the container altogether. We could try to guess/detect it, but it would be brittle.

I'd like to not disable caching all together. The singularity v3's ability to launch containers like docker run (no wait time) is a huge benefit for us.

@dtrudg
Copy link
Contributor

dtrudg commented Feb 6, 2020

We've just discussed this a bit, and are looking to maybe proceed as follows:

Add a global option to disable cache totally (this can be added quickly).

Rework the caching to benefit from atomic rename and be safe on local file-systems only, where true atomic rename is possible. Detect network file-system types and print a warning that it is not safe, and may fail.

@soichih - regarding your follow-up, this isn't likely to be a complete solution...

We can't do docker pull before we run jobs as we don't know which container/tag that our Apps might end up running. They could update tag, or change the container altogether. We could try to guess/detect it, but it would be brittle.

I'm curious what type of jobs are running where the container may change after job submission, and that is a requirement rather than a reproducibility concern? It's honestly not something we'd anticipate coming across often, so it'd be good to understand your use case a bit more.

I'd like to not disable caching all together. The singularity v3's ability to launch containers like docker run (no wait time) is a huge benefit for us.

We can fix it for caching to local file-systems only, but there is simply no way to do it on a cluster or with shared file-systems in a daemon-less way or without interprocess communication between instances of singularity, given the lack of guarantees on shared file systems. Some of the strongest benefits of Singularity come from the fact it is daemon-less, and doesn't have complex co-ordination, so these aren't things we'd likely introduce.

@h3kker
Copy link
Author

h3kker commented Feb 7, 2020

There is another disadvantage of downloading directly to the final file: it leaves broken files if the pull fails midway for some reason (even ctrl+c, which would be preventable by catching signals). Subsequent runs also fail the hash check.

Disabling caching globally is not a very attractive option for me. Even the most minimal containers can take a lot of space away when running 1000 execution instances in parallel.

The nextflow issue can be hacked around by setting errorStrategy to retry with a reasonable maxRetries setting, btw, because by the time other instances fail & retry, the download might be finished.

It's possible, e.g., that a sporadic failure due to a non-atomic rename race or similar might happen 20 hours into a 24hr job, and that would result overall in a worse experience than hitting this early/often, and adopting the 'pull first' workaround.

Not sure how the current behavior helps there - if my first parallel processing with pulling images starts after those 20hrs, it'll crash after those 20hrs and not before. Also, we're more like "crash 10 days into a 14 day job" here ;-)

@dtrudg
Copy link
Contributor

dtrudg commented Feb 7, 2020

There is another disadvantage of downloading directly to the final file: it leaves broken files if the pull fails midway for some reason (even ctrl+c, which would be preventable by catching signals). Subsequent runs also fail the hash check.

Agreed - signal handling has to be tidied up in various places also, but rename will help here.

It's possible, e.g., that a sporadic failure due to a non-atomic rename race or similar might happen 20 hours into a 24hr job, and that would result overall in a worse experience than hitting this early/often, and adopting the 'pull first' workaround.

Not sure how the current behavior helps there - if my first parallel processing with pulling images starts after those 20hrs, it'll crash after those 20hrs and not before. Also, we're more like "crash 10 days into a 14 day job" here ;-)

Yeah - am aware it's only less disruptive if the first parallel step is early and it fails there. There are lots of situations here, and given there is no magic fix we can make generically when using non local-fs there are going to be cases which are nasty, but we can try to warn that they might occur ahead of time,

@dtrudg
Copy link
Contributor

dtrudg commented Mar 6, 2020

If you are feeling adventurous you can try the current master branch that has the first stage of a big rework to address the cache race issue, and inconsistencies with the way the different clients for OCI, library, shub, net sources handled (or not) cache, cleanup etc.

This work is ongoing in master - there are some known regressions in the issues list that will be addressed over the next week or so.

@h3kker
Copy link
Author

h3kker commented Mar 9, 2020

Awesome! Stupid work has me on something else at the moment, but I'll get back to it soon.

@dtrudg
Copy link
Contributor

dtrudg commented Apr 27, 2020

Closing as this is believed fixed in master / forthcoming 3.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants