Failing Job 'run lastz' - Running Toil on AWS #4773

tinaveit · 2024-01-29T20:15:19Z

Hi,
We are performing genome alignments using Progressive Cactus software and toil on Amazon Web Services.We don't have much computational background, and are getting the following error (several times for different runs):

No log file is present, despite job failing: 'run_lastz' 41ff109d-df4c-45eb-897b-a7c2c9eae970 v1 Due to failure we are reducing the remaining try count of job 'run_lastz' 41ff109d-df4c-45eb-897b-a7c2c9eae970 v1 with ID 41ff109d-df4c-45eb-897b-a7c2c9eae970 to 5 We have increased the default memory of the failed job 'run_lastz' 41ff109d-df4c-45eb-897b-a7c2c9eae970 v1 to 2147483648 bytes We have increased the disk of the failed job 'run_lastz' 41ff109d-df4c-45eb-897b-a7c2c9eae970 v1 to the default of 2147483648 bytes Issued job 'run_lastz' 41ff109d-df4c-45eb-897b-a7c2c9eae970 v2 with job batch system ID: 17361 and disk: 2.0 Gi, memory: 2.0 Gi, cores: 1, accelerators: [], preemptible: False Job failed with exit value 255: 'run_lastz' 257e6b07-2875-4f0f-b233-166bc7d02364 v1 Exit reason: 3

Then it mentions that several jobs were lost.

It doesn't seem to crash the Cactus run, but once workflow is 100%, there is no output file. We haven't deleted any files from the jobstore on AWS, and we are feeling lost on what this error could be related to. Moreover, I have the full logfile downloaded to my local in case it may be helpful as well. Any help/ideas would be greatly appreciated, especially cause each AWS run is costly. Thank you for your time!

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1488

The text was updated successfully, but these errors were encountered:

adamnovak · 2024-01-29T20:22:50Z

It looks like we should probably be logging that as Exit reason: LOST instead of Exit reason: 3:

toil/src/toil/batchSystems/abstractBatchSystem.py

Lines 53 to 54 in 73f155e

    
           LOST: int = 3 
        
           """Preemptable failure (job's executing host went away)."""

When jobs are "lost" that is supposed to mean that the machine that was supposed to be running them itself vanished. Usually on AWS this will be because it is a spot market machine, which Amazon is allowed to take back if someone outbids you for it.

It sounds like you set up a cluster with AWS spot instances, and the Cactus workflow says that the lastz jobs are OK to get preempted and can run on spot nodes. But, your spot nodes are not sticking around long enough to actually finish the lastz jobs; either the spot price is rising higher than your spot bid, or Amazon is taking them away for other reasons.

Maybe try raising your spot bid, or else not using spot instances and paying for all reserved instances?

adamnovak · 2024-01-29T20:25:43Z

When you say "workflow is 100%", do you mean that the progress bar Toil shows hit 100%, or that the Toil process reported that the workflow completed successfully, and then returned a 0 exit code? The progress bar is not actually very good; it only knows about the jobs that have already been created in the workflow and doesn't predict what jobs will be created by those jobs in the future. So for a lot of workflows it will reach 100% because all the jobs it knows about have run, and then go back down when it looks and sees new jobs have been added to do.

glennhickey · 2024-03-01T12:49:46Z

Following up from email from @tinaveit . The command line is apparently

cactus --consCores 2 --nodeTypes c4.8xlarge,r4.8xlarge --minNodes 0,0 --maxNodes 20,1 --nodeStorage 250 --provisioner aws --batchSystem mesos --metrics aws:us-east-1:xtremo-devo-cactus --logFile cactus.log seqFile.txt output.hal

This means that spot nodes are not being used (as the --nodeTytpes arguments don't have :<bid> suffixes). Not sure what's happening, are you able to post the whole log?

tinaveit · 2024-03-01T15:31:36Z

Absolutely! Here is the log file:
cactus.log
Also, answering @adamnovak , it was the progress bar that reached 100%, but the exit code was not 0. So, it makes sense with what you said. Thanks for your support!

glennhickey · 2024-03-01T20:58:02Z

This log looks fine. It's normal for lastz jobs to get swallowed up like that (I don't know why) but they were successfully rescheduled. At the end of your log, it looked like it was running along fine in the "bar" stage. A few things that did look suspicious:

you specified --consCores 2. That means the cactus job (which is running at the end of your log) only gets 2 threads. So something I'd expect to take about 2 hours for 700mb genomes on 64 cores may take 30x longer.
your tree (Oryzias_latipes:0.121261,Nematolebias_whitei:0.141531)0.471387; gives the root node name 0.471387. This is supposed to trigger an error message in Cactus. It didn't so it'll probably run through, but you want to avoid this in general. (note that distances come after : chracters but there's no node above it so your input tree should just be (Oryzias_latipes:0.121261,Nematolebias_whitei:0.141531);

tldr: it seems to be running fine -- if it seems too slow give it more cores with --consCores.

tinaveit · 2024-03-01T21:11:28Z

Okay, sounds good. Thanks so much for your input. I will change the tree and add some threads. Any suggestions of what would be an adequate number of --consCores? We had to add that argument, cause it was giving us errors in the beginning

glennhickey · 2024-03-01T21:49:20Z

--consCores should be equal to the number of cores on the AWS instance you are running on. I guess 32 here?

tinaveit · 2024-03-11T15:26:10Z

@glennhickey I ran today with the updated consCores value and changed the tree. The cactus run crashed due to a failed job, and exited the instance. I have attached the end of the log file, where it shows the error and the failures. Everything above that looked normal. Not sure what could be going wrong now. Thanks for your time!
cactus_end.log

glennhickey · 2024-03-11T15:43:21Z

This one looks like a bug

            return max(min(int(os.environ['CACTUS_MAX_MEMORY']), memory_bytes), int(os.environ['CACTUS_DEFAULT_MEMORY']))
        ValueError: invalid literal for int() with base 10: 'None'

I added these environment variables in the most recent version of Cactus (they are set by Cactus). They work fine locally and on slurm, but if I'm reading this right are somehow getting lost on AWS (which is not part of our automated tests).

Thanks for raising this -- I will try to reproduce it here and find a solution.

glennhickey · 2024-03-11T20:11:03Z

I can reproduce this. @adamnovak Just to confirm: on Slurm, environment variables get passed to worker nodes but on aws/mesos this is not the case. Is there a rationale behind this? Could/should Toil be consistent and pass through environment variables on mesos?

adamnovak · 2024-03-12T14:41:00Z

We're meant to snapshot the environment when the workflow starts up, and then ship that off to the worker to restore when running jobs:

toil/src/toil/options/common.py

Lines 680 to 682 in 0446fe2

    
           "be looked up in the current environment. Independently of this option, the worker " 
        
           "will try to emulate the leader's environment before running a job, except for " 
        
           "some variables known to vary across systems.  Using this option, a variable can "

We capture os.environ inside start() here, and save to environment.pickle in the job store:

toil/src/toil/common.py

Line 898 in 0446fe2

self._serialiseEnv()

And we restore it here in the worker:

toil/src/toil/worker.py

Lines 192 to 228 in 0446fe2

    
           #First load the environment for the job. 
        
           with jobStore.read_shared_file_stream("environment.pickle") as fileHandle: 
        
               environment = safeUnpickleFromStream(fileHandle) 
        
           env_reject = { 
        
               "TMPDIR", 
        
               "TMP", 
        
               "HOSTNAME", 
        
               "HOSTTYPE", 
        
               "HOME", 
        
               "LOGNAME", 
        
               "USER", 
        
               "DISPLAY", 
        
               "JAVA_HOME", 
        
               "XDG_SESSION_TYPE", 
        
               "XDG_SESSION_CLASS", 
        
               "XDG_SESSION_ID", 
        
               "XDG_RUNTIME_DIR", 
        
               "XDG_DATA_DIRS", 
        
               "DBUS_SESSION_BUS_ADDRESS" 
        
           } 
        
           for i in environment: 
        
               if i == "PATH": 
        
                   # Handle path specially. Sometimes e.g. leader may not include 
        
                   # /bin, but the Toil appliance needs it. 
        
                   if i in os.environ and os.environ[i] != '': 
        
                       # Use the provided PATH and then the local system's PATH 
        
                       os.environ[i] = environment[i] + ':' + os.environ[i] 
        
                   else: 
        
                       # Use the provided PATH only 
        
                       os.environ[i] = environment[i] 
        
               elif i not in env_reject: 
        
                   os.environ[i] = environment[i] 
        
           # sys.path is used by __import__ to find modules 
        
           if "PYTHONPATH" in environment: 
        
               for e in environment["PYTHONPATH"].split(':'): 
        
                   if e != '': 
        
                       sys.path.append(e)

Now, Slurm's sbatch will additionally do its own environment capture and restore, and single-machine Toil also will end up inheriting the leader's environment for free anyway, so it's possible this system is somehow broken and we haven't noticed.

glennhickey · 2024-03-12T15:25:31Z

Thanks @adamnovak . This does seem to be a Toil bug then. It reproduces quite quickly with

virtualenv --system-site-packes cactus-venv
. cactus-venv/bin/activate
wget https://github.com/ComparativeGenomicsToolkit/cactus/releases/download/v2.7.2/cactus-bin-v2.7.2.tar.gz
tar zxf cactus-bin-v2.7.2.tar.gz
cd cactus-bin-v2.7.2
pip install -U .
cactus aws:us-west-2:glennhickey-jobstore11 ./examples/evolverMammals.txt em.hal --provisioner aws --batchSystem mesos --nodeTypes c4.8xlarge --skipPreprocessor --consCores 4

which fails with the same error reported above. It happens because it can't find the CACTUS_MAX_MEMORY and CACTUS_DEFAULT_MEMORY environment variables which cactus sets itself before start(). I also tried to export them on the command line first and got the same result.

glennhickey · 2024-03-12T22:02:58Z

@tinaveit This is a bug in cactus, not Toil as @adamnovak pointed out.

The good news is that I think you should be able to resume your workflow by running with --restart --defaultMemory 2000000000 --maxMemory 256000000000. Thanks for raising this, I'll make sure it's fixed in the next Cactus release (which should be pretty soon).

adamnovak · 2024-03-13T14:34:12Z

@tinaveit I think we have sorted out all the problems reported here, so I'm going to close this. If you hit another problem in the same workflow, can you open a new issue and reference this one?

adamnovak mentioned this issue Mar 12, 2024

Toil leader environment pickle-and-restore may be broken and not work on Mesos #4835

Closed

glennhickey mentioned this issue Mar 12, 2024

Fix bug introdued with new memory environment variables ComparativeGenomicsToolkit/cactus#1309

Merged

adamnovak closed this as completed Mar 13, 2024

tinaveit mentioned this issue Mar 22, 2024

Running Cactus on AWS - Memory Issue ComparativeGenomicsToolkit/cactus#1322

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing Job 'run lastz' - Running Toil on AWS #4773

Failing Job 'run lastz' - Running Toil on AWS #4773

tinaveit commented Jan 29, 2024 •

edited by unito-bot

Loading

adamnovak commented Jan 29, 2024

adamnovak commented Jan 29, 2024 •

edited

Loading

glennhickey commented Mar 1, 2024

tinaveit commented Mar 1, 2024

glennhickey commented Mar 1, 2024

tinaveit commented Mar 1, 2024

glennhickey commented Mar 1, 2024

tinaveit commented Mar 11, 2024

glennhickey commented Mar 11, 2024

glennhickey commented Mar 11, 2024

adamnovak commented Mar 12, 2024

glennhickey commented Mar 12, 2024

glennhickey commented Mar 12, 2024

adamnovak commented Mar 13, 2024

Failing Job 'run lastz' - Running Toil on AWS #4773

Failing Job 'run lastz' - Running Toil on AWS #4773

Comments

tinaveit commented Jan 29, 2024 • edited by unito-bot Loading

adamnovak commented Jan 29, 2024

adamnovak commented Jan 29, 2024 • edited Loading

glennhickey commented Mar 1, 2024

tinaveit commented Mar 1, 2024

glennhickey commented Mar 1, 2024

tinaveit commented Mar 1, 2024

glennhickey commented Mar 1, 2024

tinaveit commented Mar 11, 2024

glennhickey commented Mar 11, 2024

glennhickey commented Mar 11, 2024

adamnovak commented Mar 12, 2024

glennhickey commented Mar 12, 2024

glennhickey commented Mar 12, 2024

adamnovak commented Mar 13, 2024

tinaveit commented Jan 29, 2024 •

edited by unito-bot

Loading

adamnovak commented Jan 29, 2024 •

edited

Loading