Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Singularity containers #61

Closed
stefanches7 opened this issue Mar 27, 2020 · 13 comments · Fixed by #65
Closed

Singularity containers #61

stefanches7 opened this issue Mar 27, 2020 · 13 comments · Fixed by #65
Assignees

Comments

@stefanches7
Copy link

stefanches7 commented Mar 27, 2020

As clusters in the academia often ban docker as an obligation for people to get root rights, maybe it is possible to use the current workflow with Singularity? Maybe just convert the existing Docker steps to Singularity containers (e.g. with https://github.com/singularityhub/docker2singularity)?

@replikation
Copy link
Owner

  • yes we will implement this as a separate profile, should be quite easy to do via nextflow
  • i added this issue to our ToDo
  • thanks for the suggestion

@hoelzer
Copy link
Collaborator

hoelzer commented Mar 27, 2020

Thanks for the suggestion.

Actually, depending on the configuration of your HPC, you can already use WtP with singularity support. I was able to run the pipeline on an LSF system w/ singularity. Nextflow will handle the conversion of images from Dockerhub into Singularity images automatically.

Currently, we have only a profile for LSF that works with singularity and needs some additional parameters depending on your HPC structure:

nextflow run phage.nf --fasta your.fasta --workdir ${WORK} --databases ${DB} --cachedir ${SINGULARITY} -profile lsf 
  • --workdir defines the path where nextflow writes tmp files
  • --databases defines the path where databases are stored
  • --cachedir defines the path where images (singularity) are cached

Depending on your HPC scheduler (LSF, SLURM, ...) you can use the lsf.config as a template and for example activate SLURM instead of LSF.

executor {
    name = "slurm"
    queueSize = 200
}

In the future we will use Nextflows functionality of merging different profiles to make this easier and likely also add Singularity directly.

@stefanches7
Copy link
Author

Thanks! Seems like a good alternative.
We are using SLURM in the lab. When I try to run with the configuration you recommended, I get the following error:

Error executing process > 'ppr_dependecies:ppr_download_dependencies'

Caused by:
  Failed to submit process to grid scheduler for execution

Command executed:

  bsub

Command exit status:
  1

Command output:
  (more omitted..)
      [[ "$pid" ]] && nxf_kill $pid
  }

  nxf_launch() {
      /bin/bash -ue .command.sh
  }

  nxf_stage() {
      true
  }

  nxf_unstage() {
      true
      [[ ${nxf_main_ret:=0} != 0 ]] && return
  }

  nxf_main() {
      trap on_exit EXIT
      trap on_term TERM INT USR1 USR2

      [[ "${NXF_CHDIR:-}" ]] && cd "$NXF_CHDIR"
      NXF_SCRATCH=''
      [[ $NXF_DEBUG > 0 ]] && nxf_env
      touch .command.begin
      set +u
      set -u
      export PATH="/data/nasif12/home_if12/dvoretsk/projects/What_the_Phage/bin:$PATH"
      [[ $NXF_SCRATCH ]] && echo "nxf-scratch-dir $HOSTNAME:$NXF_SCRATCH" && cd $NXF_SCRATCH
      nxf_stage

      set +e
      local ctmp=$(set +u; nxf_mktemp /dev/shm 2>/dev/null || nxf_mktemp $TMPDIR)
      local cout=$ctmp/.command.out; mkfifo $cout
      local cerr=$ctmp/.command.err; mkfifo $cerr
      tee .command.out < $cout &
      tee1=$!
      tee .command.err < $cerr >&2 &
      tee2=$!
      ( nxf_launch ) >$cout 2>$cerr &
      pid=$!
      wait $pid || nxf_main_ret=$?
      wait $tee1 $tee2
      nxf_unstage
  }

  $NXF_ENTRY
  "'
  and the output was:
  'sbatch: error: Script arguments not permitted with --wrap option
  '

Apparently some error with SLURM biding. Any idea what could it be?

@hoelzer
Copy link
Collaborator

hoelzer commented Mar 27, 2020

Looks for me that still lsf is used as a scheduler instead of slurm. Please try

cp configs/lsf.config configs/slurm.config

and change

executor {
    name = "lsf"
    queueSize = 200
}

to

executor {
    name = "slurm"
    queueSize = 200
}

and then add the new configuration to nextflow.config:

 slurm {
        params.cloudProcess = true
        includeConfig 'configs/slurm.config'
 }

and then try again with -profile slurm and adding --workdir, --databases, and --cachedir.

hoelzer added a commit that referenced this issue Mar 27, 2020
@hoelzer
Copy link
Collaborator

hoelzer commented Mar 27, 2020

I also added this code to a slurm branch that you can access via

git checkout slurm

Unfortunately, I can not test this because I dont have access to a SLURM w/ Singularity atm. Thus, I would highly appreciate when you can report back if this is working.

@stefanches7
Copy link
Author

Thank you, @hoelzer ! It seems to work with an on SLURM, however I get the following now:

Command error:
  ERROR  : Unknown image format/type: /s/project/phagehost/c/nanozoo-basics-1.0--962b907.img
  ABORT  : Retval = 255
  

Although I've seen "Convert SIF to Sandbox" messages before in the info output, it seems that Singularity still tries to run .img images directly. I use ./nextflow ../What_the_Phage/phage.nf --fasta /s/project/phagehost/data/all_contigs.fa --workdir /s/project/phagehost/workdir --databases /s/project/phagehost/dbs --cachedir /s/project/phagehost/c -profile slurm --mp --vb as command

@hoelzer
Copy link
Collaborator

hoelzer commented Apr 6, 2020

@stefanches7 okay, great and thanks for the feedback!

Can you run

singularity shell /s/project/phagehost/c/nanozoo-basics-1.0--962b907.img

without any problems? You should see an output like:

(base) [mhoelzer@noah-login-01 ~]$ singularity shell /hps/nobackup2/singularity/mhoelzer/nanozoo-basics-1.0--da8477a.img
Singularity: Invoking an interactive shell within container...

Singularity nanozoo-basics-1.0--da8477a.img:~>

What singularity version is installed on your SLURM cluster?

singularity --version

@stefanches7
Copy link
Author

@hoelzer thanks for tips of things to begin with. Right configuration (e.g. enabling singularity instead of docker) has solved the abovementioned problem.

However, there are some bad news to the original issue:

Caused by:
  Failed to pull singularity image
  command: singularity pull  --name multifractal-deepvirfinder-0.1.img docker://multifractal/deepvirfinder:0.1 > /dev/null
  status : 255
  message:
    [34mINFO:    Converting OCI blobs to SIF format
    INFO:    Starting build...
    Getting image source signatures
    Copying blob sha256:1ab2bdfe97783562315f98f94c0769b1897a05f7b0395ca1520ebee08666703b
    Copying blob sha256:dd7d28bd8be53eeb346a7895aa923cc5fd8707cd893b5a96f6de37d0473431f8
    Copying blob sha256:af998e3a361bf15f05a9cf4686abe0bc34affbb1bb1d541f76553c5842e6e4fb
    Copying blob sha256:6be933a0087233e57da44b1280a6c8893b8d6420b9f1c30c55f7ed6802890991
    Copying blob sha256:3fff041945a0630cad87f4537423376297b4ca2c9ccec89e3b95993510407204
    Copying blob sha256:c76e058ed3a85d77e1837bbceaa2a9e1ee631c71fef94c26dd5c025b03c575f0
    Copying blob sha256:60405d166660bd2dd51e66387361dc0b7b8564300cd6d7ef069ed8c8a37b05ca
    Copying blob sha256:9d264a0e456bf7020807a53c41749d53c76f8d98827a3dc7058eb02a2dcda45f
    Copying config sha256:42368a0e7d7541b5f807cd7543963f6830743f3c351f6cd549033eb05edf7f3f
    Writing manifest to image destination
    Storing signatures
    FATAL:   While making image from oci registry: while building SIF from layers: conveyor failed to get: no descriptor found for reference "124671135cee740da1b70560958478a8aa2d76076271670f
0f9efebc70c5d426"

I am not yet sure what the cause is, however it is explainable by so-called "Concurrent pull" which is currently not supported by Singularity: https://github.com/sylabs/singularity/issues/5020.
That shouldn't however depend on the executor, so it would be interesting to know - was something like this occuring during multi-node lsf execution?

@hoelzer
Copy link
Collaborator

hoelzer commented Apr 9, 2020

@stefanches7 Okay, we are getting there.

Are these variables set for you or are they empty?

export SINGULARITY_LOCALCACHEDIR=/gpfs/scratch/[username]
export SINGULARITY_CACHEDIR=/gpfs/scratch/[username]
export SINGULARITY_TMPDIR=/gpfs/scratch[username]

In my experience, it might help to set these variables according to your HPC configuration to point to directories where you have write permission and enough disk space.

Maybe then just try the command out of the Nextflow environment and see if this works first:

singularity pull --name multifractal-deepvirfinder-0.1.img docker://multifractal/deepvirfinder:0.1

This was working for me on LSF with Singularity v2.6.0-dist.

For example, my configuration is:

(base) [mhoelzer@noah-login-01 ~]$ echo $SINGULARITY_CACHEDIR 
/hps/nobackup2/production/metagenomics/mhoelzer
(base) [mhoelzer@noah-login-01 ~]$ echo $SINGULARITY_LOCALCACHEDIR 
/scratch

So the Singularity image is then stored at

/hps/nobackup2/production/metagenomics/mhoelzer/multifractal-deepvirfinder-0.1.img

@stefanches7
Copy link
Author

@hoelzer yeah, I've also specified Singularity environment variables, and standalone pull does work.
I might pull all the needed containers standalone, and then run the pipeline on the cached images. That's a temporary workaround of course.
If we think of a more lasting solution and assume that the problem is concurrent pull, I notice there are some withLabels directives stating multiple cpus for some programs - maybe there is a way to declare pulling a container just on one cpu, and then running it on multiple?

@hoelzer
Copy link
Collaborator

hoelzer commented Apr 9, 2020

Yeah, that might be a workaround for now. Pulling the images manually and then run the pipeline by pointing the --cachedir to the directory with the image files.

I don't think that the multiple CPUs that are specified for certain processes are an issue. They just tell SLURM (or LSF, ...) how many CPUs are requested when submitting the corresponding job. I am not entirely sure whether:

  1. the job is submitted to a compute node (with requested CPUs and RAM) and then from the compute node the image is pulled

or

  1. the image is pulled from the node where the pipeline is executed and when the image is available the job is submitted (with requested CPUs and RAM)

If you want, you can easily test your hypothesis by setting all CPU values in the SLURM config file to 1.

What you can also try: do not start the pipeline from a login node. Start an interactive session on a compute node and then execute the pipeline like so (I think):

srun -N 1 --ntasks-per-node=2 --pty bash

@stefanches7
Copy link
Author

@hoelzer I think I would try the first way. Where can I find the locations of tools Docker containers?

@hoelzer
Copy link
Collaborator

hoelzer commented Apr 9, 2020

okay, here you can see all docker images the pipeline is using:
https://github.com/replikation/What_the_Phage/blob/master/configs/local.config

If you are able to generate and store singularity images from these docker images via singularity pull ... you should be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants