Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

latest available lustre-utils not downloading #152

Open
gnvalbuena opened this issue Jul 12, 2016 · 8 comments
Open

latest available lustre-utils not downloading #152

gnvalbuena opened this issue Jul 12, 2016 · 8 comments

Comments

@gnvalbuena
Copy link

Hi,

I was trying to set up a lustre filesystem for an analysis on AWS, but the mount step was causing a fatal error and aborting the mount. The relevant error message that I get is as follows:

TASK: [download latest available lustre-utils] ********************************
failed: [frontend001] => {"dest": "/root/lustre-utils_1.8.5+dfsg-3.1ubuntu2_amd64.deb", "failed": true, "response": "HTTP Error 404: Not Found", "state": "absent", "status_code": 404, "url": "http://archive.ubuntu.com/ubuntu/pool/universe/l/lustre/lustre-utils_1.8.5+dfsg-3.1ubuntu2_amd64.deb"}
msg: Request failed
failed: [compute001] => {"dest": "/root/lustre-utils_1.8.5+dfsg-3.1ubuntu2_amd64.deb", "failed": true, "response": "HTTP Error 404: Not Found", "state": "absent", "status_code": 404, "url": "http://archive.ubuntu.com/ubuntu/pool/universe/l/lustre/lustre-utils_1.8.5+dfsg-3.1ubuntu2_amd64.deb"}
msg: Request failed
failed: [compute002] => {"dest": "/root/lustre-utils_1.8.5+dfsg-3.1ubuntu2_amd64.deb", "failed": true, "response": "HTTP Error 404: Not Found", "state": "absent", "status_code": 404, "url": "http://archive.ubuntu.com/ubuntu/pool/universe/l/lustre/lustre-utils_1.8.5+dfsg-3.1ubuntu2_amd64.deb"}
msg: Request failed

FATAL: all hosts have already failed -- aborting
Failures: compute002 (1 failures), compute001 (1 failures), frontend001 (1 failures)

I had a quick look at the URL specified, and it seems to contain "lustre-utils_1.8.5+dfsg-3ubuntu1_amd64.deb" rather than "lustre-utils_1.8.5+dfsg-3.1ubuntu2_amd64.deb".

Best,
Gabriel

@chapmanb
Copy link
Member

Gabriel;
Apologies, it looks like the Lustre filesystem ansible scripts are a bit out of date and will need a refresh. In the short term, you can use the encrypted NFS shared filesystem installed as part of the setup:

http://bcbio-nextgen.readthedocs.io/en/latest/contents/cloud.html#running-a-cluster

Sorry to not have an immediate Lustre fix but hope this works for your needs.

@gnvalbuena
Copy link
Author

HI Brad,

Yeah, I switched to using the NFS system immediately after the Lustre filesystem wasn't working for me, so I've got my analysis running anyway. Just wanted to flag up the fault in case someone else comes up against it again.

I was wondering how much space in the shared filesystem does bcbio need when running? I'm running my analysis on a few hundred samples, so was concerned about how much space I needed to have on hand. At the moment, I've just provisioned a TB of NFS space, as I didn't want the cluster to run out of space mid analysis, but I suspect that was probably too much? Just wanted to know in the interest of keeping AWS costs down for future runs.

Also, unrelated question, is there a way to know if my analysis on a cluster is still running ok and what stage it's at? On a single machine, I used to just rely on the log, but not quite sure what to do on SLURM. squeue and sacct_std tell me the processes are still running, but wanted to know if there was a way to check what step of the process it's at?

Thanks!

@chapmanb
Copy link
Member

Glad the NFS option worked for you. We normally estimate ~3x the size of the bgzipped input fastq files for pipelines not involving recalibration/realignment (original fastqs, BAMs, associated files) and 4-5x for those that have recalibration and realignment.

You should be able to track the progress of the run in your slurm* output file, as well as in log/bcbio-nextgen.log for a high level view and log/bcbio-nextgen-debug.log for a detailed view. Hope this helps.

@gnvalbuena
Copy link
Author

I've had a look at the slurm output file, and both log files, and there doesn't seem to be any progress on the log past the fourth hour of the run. I'm already at 1.5 days from when I started it, and am not entirely sure whether the analysis is progressing, or if something has stalled.

my slurm-2.out ends with

[2016-07-13T02:24Z] compute002: Preparing 3001731548
[2016-07-13T02:24Z] compute001: Preparing 3001731547
[2016-07-13T02:24Z] compute002: Running in docker container: 1dc087e344a0763604f66f868096d76f4fe32db1c6842670a643df31baf3f71e

as its last few entries. My bcbio-nextgen.log file ends with

[2016-07-13T01:55Z] compute002: Downloading hg19 samtools from AWS
[2016-07-13T01:55Z] compute002: Testing minimum versions of installed programs
[2016-07-13T01:55Z] compute001: ipython: prepare_sample

as its last few entries, and the bcbio-nextgen-debug.log file ends with

[2016-07-13T02:24Z] compute002: Preparing 3001731548
[2016-07-13T02:24Z] compute001: Preparing 3001731547
[2016-07-13T02:24Z] compute002: Running in docker container: 1dc087e344a0763604f66f868096d76f4fe32db1c6842670a643df31baf3f71e

as its last few entries. There have been no new files added into my work folder, although sacct_std still lists down the same running processes from when I first started checking.

This is probably just down to my inexperience with working with slurm processes, but I was wondering whether this indicates that my run is still working fine, or that something is wrong but the cluster is still going? What is the expected sequence of outputs on the log?

My main concern is that since there are no new files being written to the work directory, nothing is actually going on with my analysis. It's just a bit worrisome as I've put on a fairly sizeable cluster to deal with the large number of data files I'm analysing, and would like to know whether I need to leave it running, or whether to terminate the cluster if it's no longer actually carrying on with the analysis properly?

@chapmanb
Copy link
Member

Sorry about the issue -- it does look like something is wrong with the run but it's hard to diagnose what is happening from what you have so far. It should have provide some kind of error or something else helpful, but since that didn't happen my suggestion is to:

  • Shut down the big cluster so you avoid wasting resources.
  • Restart a single node cluster
  • Rerun on a single machine with bcbio_vm.py run to see if you can reproduce the error and get more logging information about what is going wrong.

It's pretty early in the process so you hopefully can identify the problem quickly on the re-run.

Apologies about the issues. We're actively working on updating our AWS runs to use Common Workflow Language (http://bcbio-nextgen.readthedocs.io/en/latest/contents/cwl.html) so we can run with tools like Toil on AWS (http://toil.readthedocs.io/en/latest/) to improve debugging and resource usage.

Hope this helps.

@gnvalbuena
Copy link
Author

I tried it again on a single node cluster, and the last error that was coming up was relating to mirdeep2 (I'm running a small rna analysis, which I realize I hadn't mentioned up to now).

The error that it throws up is:

#excising precursors
started: 3:55:23
/usr/local/share/bcbio-nextgen/anaconda/lib/mirdeep2//excise_precursors_iterati$
#excising precursors
1       121480
2       121480
...
49      121480
No file mirdeep_runs/run_res/tmp/precursors.fa_stack found
' returned non-zero exit status 2

There's another error earlier in the run that refers to the same "excising precursors" step, but I wasn't able to copy it over in full. Haven't quite figured out how to copy the log folder out to my local computer or to s3, so the error messages I was able to copy are truncated. Sorry.

[2016-07-15T03:55Z] #excising precursors
[2016-07-15T03:55Z] started: 3:55:23
[2016-07-15T03:55Z] /usr/local/share/bcbio-nextgen/anaconda/lib/mirdeep2//excise_prec$
[2016-07-15T03:55Z] #excising precursors
[2016-07-15T03:56Z] 1   121480
[2016-07-15T03:56Z] 47  121480
[2016-07-15T03:56Z] Uncaught exception occurred
Traceback (most recent call last):
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pro$
    _do_run(cmd, checks, log_stdout)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pro$
    raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; unset PERL5LIB && export PATH=/usr/loca$
creating pdf for hsa-mir-664b finished

I've dealt with it in the meantime by just stripping out mirdeep2 from my analysis, and waiting to see if any errors come up with the remaining seqcluster and trna sections of the analysis.

@lpantano
Copy link
Collaborator

Hi,

sorry about these errors. I don't see the full command in the last chunk of code you posted it. I don't know if it is mirdeep2 or not. The first one should be ok in the sense that it should continue even if you see that in the log. Any chance you can get the full *log files and send to me? or post it here?

I will try to replicate that with a separate data and see if I can get more information. Let me know if skipping mirdeep2 helps in this case.

cheers

@gnvalbuena
Copy link
Author

gnvalbuena commented Jul 15, 2016

Hi,

Sorry, I hadn't put that up in sequence, as I wasnt able to copy the truncated error message properly. The error I showed second actually came first in the log. It came up in between a set of creating pdf for hsa-mir-xxx finished messages, and it did continue on until the No file mirdeep_runs/run_res/tmp/precursors.fa_stack found error.

The first error I showed in the previous message was the very last item on the log. At that point, there were no further additions to the log, even after several hours.

I can try to recreate the error later, if that would help, once my current analysis run finishes. I just figured out how to copy files out of the cluster, so hopefully, I can get a full log out if I can recreate the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants