-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing Job 'run lastz' - Running Toil on AWS #4773
Comments
It looks like we should probably be logging that as toil/src/toil/batchSystems/abstractBatchSystem.py Lines 53 to 54 in 73f155e
When jobs are "lost" that is supposed to mean that the machine that was supposed to be running them itself vanished. Usually on AWS this will be because it is a spot market machine, which Amazon is allowed to take back if someone outbids you for it. It sounds like you set up a cluster with AWS spot instances, and the Cactus workflow says that the lastz jobs are OK to get preempted and can run on spot nodes. But, your spot nodes are not sticking around long enough to actually finish the lastz jobs; either the spot price is rising higher than your spot bid, or Amazon is taking them away for other reasons. Maybe try raising your spot bid, or else not using spot instances and paying for all reserved instances? |
When you say "workflow is 100%", do you mean that the progress bar Toil shows hit 100%, or that the Toil process reported that the workflow completed successfully, and then returned a 0 exit code? The progress bar is not actually very good; it only knows about the jobs that have already been created in the workflow and doesn't predict what jobs will be created by those jobs in the future. So for a lot of workflows it will reach 100% because all the jobs it knows about have run, and then go back down when it looks and sees new jobs have been added to do. |
Following up from email from @tinaveit . The command line is apparently
This means that spot nodes are not being used (as the |
Absolutely! Here is the log file: |
This log looks fine. It's normal for lastz jobs to get swallowed up like that (I don't know why) but they were successfully rescheduled. At the end of your log, it looked like it was running along fine in the "bar" stage. A few things that did look suspicious:
tldr: it seems to be running fine -- if it seems too slow give it more cores with |
Okay, sounds good. Thanks so much for your input. I will change the tree and add some threads. Any suggestions of what would be an adequate number of --consCores? We had to add that argument, cause it was giving us errors in the beginning |
|
@glennhickey I ran today with the updated consCores value and changed the tree. The cactus run crashed due to a failed job, and exited the instance. I have attached the end of the log file, where it shows the error and the failures. Everything above that looked normal. Not sure what could be going wrong now. Thanks for your time! |
This one looks like a bug
I added these environment variables in the most recent version of Cactus (they are set by Cactus). They work fine locally and on slurm, but if I'm reading this right are somehow getting lost on AWS (which is not part of our automated tests). Thanks for raising this -- I will try to reproduce it here and find a solution. |
I can reproduce this. @adamnovak Just to confirm: on Slurm, environment variables get passed to worker nodes but on aws/mesos this is not the case. Is there a rationale behind this? Could/should Toil be consistent and pass through environment variables on mesos? |
We're meant to snapshot the environment when the workflow starts up, and then ship that off to the worker to restore when running jobs: toil/src/toil/options/common.py Lines 680 to 682 in 0446fe2
We capture Line 898 in 0446fe2
And we restore it here in the worker: Lines 192 to 228 in 0446fe2
Now, Slurm's |
Thanks @adamnovak . This does seem to be a Toil bug then. It reproduces quite quickly with
which fails with the same error reported above. It happens because it can't find the |
@tinaveit This is a bug in cactus, not Toil as @adamnovak pointed out. The good news is that I think you should be able to resume your workflow by running with |
@tinaveit I think we have sorted out all the problems reported here, so I'm going to close this. If you hit another problem in the same workflow, can you open a new issue and reference this one? |
Hi,
We are performing genome alignments using Progressive Cactus software and toil on Amazon Web Services.We don't have much computational background, and are getting the following error (several times for different runs):
No log file is present, despite job failing: 'run_lastz' 41ff109d-df4c-45eb-897b-a7c2c9eae970 v1 Due to failure we are reducing the remaining try count of job 'run_lastz' 41ff109d-df4c-45eb-897b-a7c2c9eae970 v1 with ID 41ff109d-df4c-45eb-897b-a7c2c9eae970 to 5 We have increased the default memory of the failed job 'run_lastz' 41ff109d-df4c-45eb-897b-a7c2c9eae970 v1 to 2147483648 bytes We have increased the disk of the failed job 'run_lastz' 41ff109d-df4c-45eb-897b-a7c2c9eae970 v1 to the default of 2147483648 bytes Issued job 'run_lastz' 41ff109d-df4c-45eb-897b-a7c2c9eae970 v2 with job batch system ID: 17361 and disk: 2.0 Gi, memory: 2.0 Gi, cores: 1, accelerators: [], preemptible: False Job failed with exit value 255: 'run_lastz' 257e6b07-2875-4f0f-b233-166bc7d02364 v1 Exit reason: 3
Then it mentions that several jobs were lost.
It doesn't seem to crash the Cactus run, but once workflow is 100%, there is no output file. We haven't deleted any files from the jobstore on AWS, and we are feeling lost on what this error could be related to. Moreover, I have the full logfile downloaded to my local in case it may be helpful as well. Any help/ideas would be greatly appreciated, especially cause each AWS run is costly. Thank you for your time!
┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1488
The text was updated successfully, but these errors were encountered: