Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing Job 'run lastz' - Running Toil on AWS #4773

Closed
tinaveit opened this issue Jan 29, 2024 · 14 comments
Closed

Failing Job 'run lastz' - Running Toil on AWS #4773

tinaveit opened this issue Jan 29, 2024 · 14 comments

Comments

@tinaveit
Copy link

tinaveit commented Jan 29, 2024

Hi,
We are performing genome alignments using Progressive Cactus software and toil on Amazon Web Services.We don't have much computational background, and are getting the following error (several times for different runs):

No log file is present, despite job failing: 'run_lastz' 41ff109d-df4c-45eb-897b-a7c2c9eae970 v1 Due to failure we are reducing the remaining try count of job 'run_lastz' 41ff109d-df4c-45eb-897b-a7c2c9eae970 v1 with ID 41ff109d-df4c-45eb-897b-a7c2c9eae970 to 5 We have increased the default memory of the failed job 'run_lastz' 41ff109d-df4c-45eb-897b-a7c2c9eae970 v1 to 2147483648 bytes We have increased the disk of the failed job 'run_lastz' 41ff109d-df4c-45eb-897b-a7c2c9eae970 v1 to the default of 2147483648 bytes Issued job 'run_lastz' 41ff109d-df4c-45eb-897b-a7c2c9eae970 v2 with job batch system ID: 17361 and disk: 2.0 Gi, memory: 2.0 Gi, cores: 1, accelerators: [], preemptible: False Job failed with exit value 255: 'run_lastz' 257e6b07-2875-4f0f-b233-166bc7d02364 v1 Exit reason: 3

Then it mentions that several jobs were lost.

It doesn't seem to crash the Cactus run, but once workflow is 100%, there is no output file. We haven't deleted any files from the jobstore on AWS, and we are feeling lost on what this error could be related to. Moreover, I have the full logfile downloaded to my local in case it may be helpful as well. Any help/ideas would be greatly appreciated, especially cause each AWS run is costly. Thank you for your time!

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1488

@adamnovak
Copy link
Member

It looks like we should probably be logging that as Exit reason: LOST instead of Exit reason: 3:

LOST: int = 3
"""Preemptable failure (job's executing host went away)."""

When jobs are "lost" that is supposed to mean that the machine that was supposed to be running them itself vanished. Usually on AWS this will be because it is a spot market machine, which Amazon is allowed to take back if someone outbids you for it.

It sounds like you set up a cluster with AWS spot instances, and the Cactus workflow says that the lastz jobs are OK to get preempted and can run on spot nodes. But, your spot nodes are not sticking around long enough to actually finish the lastz jobs; either the spot price is rising higher than your spot bid, or Amazon is taking them away for other reasons.

Maybe try raising your spot bid, or else not using spot instances and paying for all reserved instances?

@adamnovak
Copy link
Member

adamnovak commented Jan 29, 2024

When you say "workflow is 100%", do you mean that the progress bar Toil shows hit 100%, or that the Toil process reported that the workflow completed successfully, and then returned a 0 exit code? The progress bar is not actually very good; it only knows about the jobs that have already been created in the workflow and doesn't predict what jobs will be created by those jobs in the future. So for a lot of workflows it will reach 100% because all the jobs it knows about have run, and then go back down when it looks and sees new jobs have been added to do.

@glennhickey
Copy link
Contributor

Following up from email from @tinaveit . The command line is apparently

cactus --consCores 2 --nodeTypes c4.8xlarge,r4.8xlarge --minNodes 0,0 --maxNodes 20,1 --nodeStorage 250 --provisioner aws --batchSystem mesos --metrics aws:us-east-1:xtremo-devo-cactus --logFile cactus.log seqFile.txt output.hal

This means that spot nodes are not being used (as the --nodeTytpes arguments don't have :<bid> suffixes). Not sure what's happening, are you able to post the whole log?

@tinaveit
Copy link
Author

tinaveit commented Mar 1, 2024

Absolutely! Here is the log file:
cactus.log
Also, answering @adamnovak , it was the progress bar that reached 100%, but the exit code was not 0. So, it makes sense with what you said. Thanks for your support!

@glennhickey
Copy link
Contributor

This log looks fine. It's normal for lastz jobs to get swallowed up like that (I don't know why) but they were successfully rescheduled. At the end of your log, it looked like it was running along fine in the "bar" stage. A few things that did look suspicious:

  • you specified --consCores 2. That means the cactus job (which is running at the end of your log) only gets 2 threads. So something I'd expect to take about 2 hours for 700mb genomes on 64 cores may take 30x longer.
  • your tree (Oryzias_latipes:0.121261,Nematolebias_whitei:0.141531)0.471387; gives the root node name 0.471387. This is supposed to trigger an error message in Cactus. It didn't so it'll probably run through, but you want to avoid this in general. (note that distances come after : chracters but there's no node above it so your input tree should just be (Oryzias_latipes:0.121261,Nematolebias_whitei:0.141531);

tldr: it seems to be running fine -- if it seems too slow give it more cores with --consCores.

@tinaveit
Copy link
Author

tinaveit commented Mar 1, 2024

Okay, sounds good. Thanks so much for your input. I will change the tree and add some threads. Any suggestions of what would be an adequate number of --consCores? We had to add that argument, cause it was giving us errors in the beginning

@glennhickey
Copy link
Contributor

--consCores should be equal to the number of cores on the AWS instance you are running on. I guess 32 here?

@tinaveit
Copy link
Author

@glennhickey I ran today with the updated consCores value and changed the tree. The cactus run crashed due to a failed job, and exited the instance. I have attached the end of the log file, where it shows the error and the failures. Everything above that looked normal. Not sure what could be going wrong now. Thanks for your time!
cactus_end.log

@glennhickey
Copy link
Contributor

This one looks like a bug

            return max(min(int(os.environ['CACTUS_MAX_MEMORY']), memory_bytes), int(os.environ['CACTUS_DEFAULT_MEMORY']))
        ValueError: invalid literal for int() with base 10: 'None'

I added these environment variables in the most recent version of Cactus (they are set by Cactus). They work fine locally and on slurm, but if I'm reading this right are somehow getting lost on AWS (which is not part of our automated tests).

Thanks for raising this -- I will try to reproduce it here and find a solution.

@glennhickey
Copy link
Contributor

I can reproduce this. @adamnovak Just to confirm: on Slurm, environment variables get passed to worker nodes but on aws/mesos this is not the case. Is there a rationale behind this? Could/should Toil be consistent and pass through environment variables on mesos?

@adamnovak
Copy link
Member

We're meant to snapshot the environment when the workflow starts up, and then ship that off to the worker to restore when running jobs:

"be looked up in the current environment. Independently of this option, the worker "
"will try to emulate the leader's environment before running a job, except for "
"some variables known to vary across systems. Using this option, a variable can "

We capture os.environ inside start() here, and save to environment.pickle in the job store:

self._serialiseEnv()

And we restore it here in the worker:

toil/src/toil/worker.py

Lines 192 to 228 in 0446fe2

#First load the environment for the job.
with jobStore.read_shared_file_stream("environment.pickle") as fileHandle:
environment = safeUnpickleFromStream(fileHandle)
env_reject = {
"TMPDIR",
"TMP",
"HOSTNAME",
"HOSTTYPE",
"HOME",
"LOGNAME",
"USER",
"DISPLAY",
"JAVA_HOME",
"XDG_SESSION_TYPE",
"XDG_SESSION_CLASS",
"XDG_SESSION_ID",
"XDG_RUNTIME_DIR",
"XDG_DATA_DIRS",
"DBUS_SESSION_BUS_ADDRESS"
}
for i in environment:
if i == "PATH":
# Handle path specially. Sometimes e.g. leader may not include
# /bin, but the Toil appliance needs it.
if i in os.environ and os.environ[i] != '':
# Use the provided PATH and then the local system's PATH
os.environ[i] = environment[i] + ':' + os.environ[i]
else:
# Use the provided PATH only
os.environ[i] = environment[i]
elif i not in env_reject:
os.environ[i] = environment[i]
# sys.path is used by __import__ to find modules
if "PYTHONPATH" in environment:
for e in environment["PYTHONPATH"].split(':'):
if e != '':
sys.path.append(e)

Now, Slurm's sbatch will additionally do its own environment capture and restore, and single-machine Toil also will end up inheriting the leader's environment for free anyway, so it's possible this system is somehow broken and we haven't noticed.

@glennhickey
Copy link
Contributor

Thanks @adamnovak . This does seem to be a Toil bug then. It reproduces quite quickly with

virtualenv --system-site-packes cactus-venv
. cactus-venv/bin/activate
wget https://github.com/ComparativeGenomicsToolkit/cactus/releases/download/v2.7.2/cactus-bin-v2.7.2.tar.gz
tar zxf cactus-bin-v2.7.2.tar.gz
cd cactus-bin-v2.7.2
pip install -U .
cactus aws:us-west-2:glennhickey-jobstore11 ./examples/evolverMammals.txt em.hal --provisioner aws --batchSystem mesos --nodeTypes c4.8xlarge --skipPreprocessor --consCores 4

which fails with the same error reported above. It happens because it can't find the CACTUS_MAX_MEMORY and CACTUS_DEFAULT_MEMORY environment variables which cactus sets itself before start(). I also tried to export them on the command line first and got the same result.

@glennhickey
Copy link
Contributor

@tinaveit This is a bug in cactus, not Toil as @adamnovak pointed out.

The good news is that I think you should be able to resume your workflow by running with --restart --defaultMemory 2000000000 --maxMemory 256000000000. Thanks for raising this, I'll make sure it's fixed in the next Cactus release (which should be pretty soon).

@adamnovak
Copy link
Member

@tinaveit I think we have sorted out all the problems reported here, so I'm going to close this. If you hit another problem in the same workflow, can you open a new issue and reference this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants