Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue scan: handle return value different from zero #198

Merged
merged 15 commits into from
Aug 26, 2024

Conversation

Icemole
Copy link
Collaborator

@Icemole Icemole commented Jul 17, 2024

Fix #175.

Whenever the return value of the queue scan command is different from zero, the job states are eventually set as UNKNOWN. Whenever some job is set as UNKNOWN and there's no log file attached, the job will be automatically considered as RUNNABLE and queued.

This is a very dangerous process because there's usually a job already queued "naturally", that is, queued because the queue scan job finished gracefully, found nothing, and set the job state as UNKNOWN, thus queuing the job normally.

This wrong process can happen many times, until the first job scheduled eventually enters RUNNING state, and thus a log file is generated, preventing any further same job schedulings.

This PR fixes such wrong behavior by making the queue scan wait some seconds before rescheduling it again after a failure in the scan, and not directly setting the job state as UNKNOWN.

Whenever the return value is different from zero, the job state is eventually set as UNKNOWN. If the job is set as UNKNOWN and there's no log file attached, the job will be automatically queued. This is very dangerous when there's a job already queued "naturally", that is, queued because the queue scan job finished gracefully, found nothing, and set the job state as UNKNOWN.

This wrong process can happen many times, until the first job scheduled eventually enters RUNNING state, and thus a log file is generated, preventing any further same job schedulings.
Copy-paste issue :/
Comment on lines 319 to 322
if retval != 0:
logging.warning(self._system_call_error_warn_msg(system_command))
time.sleep(gs.WAIT_PERIOD_BETWEEN_CHECKS)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this solves the problem.. If there is some fundamental problem with the queue, that is not recoverable, then this sends the function into an endless loop.

maybe instead just return the old cache?

Suggested change
if retval != 0:
logging.warning(self._system_call_error_warn_msg(system_command))
time.sleep(gs.WAIT_PERIOD_BETWEEN_CHECKS)
continue
if retval != 0:
return self._task_info_cache

In the worst case we will never update the cache and a running/finished Job will still be marked as pending.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO I would log and wait, so that the normal cache update process can run

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we should definitively not use gs.WAIT_PERIOD_BETWEEN_CHECKS

as that is used for

#: How often should the manager check for finished jobs
WAIT_PERIOD_BETWEEN_CHECKS = 30

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define a new variable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found gs.WAIT_PERIOD_QSTAT_PARSING. I assume QSTAT implies that sisyphus was originally intended to only run with SGE, and the functionality was extended afterwards. I would only use this variable and not create any other "more generic" such as gs.WAIT_PERIOD_QUEUE_PARSING.

@albertz
Copy link
Member

albertz commented Jul 17, 2024

Is this related to #175? Or is there any issue which describes the problem?

@Icemole Icemole requested a review from curufinwe July 17, 2024 13:23
sisyphus/load_sharing_facility_engine.py Show resolved Hide resolved
Comment on lines 319 to 322
if retval != 0:
logging.warning(self._system_call_error_warn_msg(system_command))
time.sleep(gs.WAIT_PERIOD_BETWEEN_CHECKS)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO I would log and wait, so that the normal cache update process can run

@michelwi
Copy link
Contributor

Is this related to #175?

yes, possibly

@Icemole
Copy link
Collaborator Author

Icemole commented Jul 17, 2024

Is this related to #175?

Yes, definitely, this is what I've been observing for quite some time now, and I believe handling this here fixes the issue :)

@albertz
Copy link
Member

albertz commented Jul 18, 2024

Whenever the return value of the queue scan command is different from zero, the job states are eventually set as UNKNOWN. Whenever some job is set as UNKNOWN and there's no log file attached, the job will be automatically considered as RUNNABLE and queued.

What do you mean by "queue scan command"? You mean what happens in the queue_state function? Or in the submit_call function?

What do you mean by "log file attached"?

This is a very dangerous process because there's usually a job already queued "naturally", that is, queued because the queue scan job finished gracefully, found nothing, and set the job state as UNKNOWN, thus queuing the job normally.

What do you mean by "finished gracefully"? Above you said the return value is different from zero? So it's not finished gracefully then? Or is it?

What do you mean by "there's usually a job already queued "naturally", that is, queued because the queue scan job finished gracefully"? Why is the job already queued? If the command exists with non-zero code, usually the job is not queued, or not?

What do you mean by "queue scan job"? Before you said "queue scan command". Do you mean the same? Do you actually mean the queue_state function?

This wrong process can happen many times, until the first job scheduled eventually enters RUNNING state, and thus a log file is generated, preventing any further same job schedulings.

This PR fixes such wrong behavior by making the queue scan wait some seconds before rescheduling it again after a failure in the scan, and not directly setting the job state as UNKNOWN.

What do you mean by "scan"?

Sorry, I don't fully understand the actual problem. Aren't you saying there is a bug in queue_state? But then, why not fix this bug? It seems like instead of fixing the bug, you add a workaround at another place? But maybe I just didn't fully understand it.

@Icemole
Copy link
Collaborator Author

Icemole commented Jul 18, 2024

@albertz I'll answer you in line.

Whenever the return value of the queue scan command is different from zero, the job states are eventually set as UNKNOWN. Whenever some job is set as UNKNOWN and there's no log file attached, the job will be automatically considered as RUNNABLE and queued.

What do you mean by "queue scan command"? You mean what happens in the queue_state function? Or in the submit_call function?

Yes, I mean what happens in the queue_state function. More specifically, I mean running the actual squeue, qstat, etc commands. If you have standard nomenclature for this, please let me know to avoid any further confusion.

What do you mean by "log file attached"?

I mean the log file attached to the job/task pair. For instance, if we're running some job whose task is run, by "log file attached" I mean the log.run.1 file created on disk. You can also see this as the normal log file generated in the filesystem by starting the job/task pair.

What do you mean by "queue scan job"? Before you said "queue scan command". Do you mean the same? Do you actually mean the queue_state function?

Yes, I mean the same, apologies for the confusing nomenclature. I mean running squeue, qstat, etc.

This wrong process can happen many times, until the first job scheduled eventually enters RUNNING state, and thus a log file is generated, preventing any further same job schedulings.
This PR fixes such wrong behavior by making the queue scan wait some seconds before rescheduling it again after a failure in the scan, and not directly setting the job state as UNKNOWN.

What do you mean by "scan"?

Running sbatch, qstat, etc. I mean the "queue scan [command/job]".

This is a very dangerous process because there's usually a job already queued "naturally", that is, queued because the queue scan job finished gracefully, found nothing, and set the job state as UNKNOWN, thus queuing the job normally.

What do you mean by "finished gracefully"? Above you said the return value is different from zero? So it's not finished gracefully then? Or is it?

What do you mean by "there's usually a job already queued "naturally", that is, queued because the queue scan job finished gracefully"? Why is the job already queued? If the command exists with non-zero code, usually the job is not queued, or not?

Sorry, I don't fully understand the actual problem. Aren't you saying there is a bug in queue_state? But then, why not fix this bug? It seems like instead of fixing the bug, you add a workaround at another place? But maybe I just didn't fully understand it.

In my original comment I was referring to two situations that interact wrongly among each other, let me detail them further (assume SLURM is the manager):

  1. The job/task pair predecessor finishes.
  2. The queue is scanned for the job/task pair. The squeue command is executed, and it returns 0 (successful, "the queue scan job finished gracefully") and there's no job/task pair waiting in the queue. Moreover, there's no log.<task>.X file ("log file attached") in the filesystem. As a consequence, the job state becomes RUNNABLE, and is thus sent into the queue shortly via sbatch. Note that, if there had been any log.<task>.X file, the job state would have become UNKNOWN instead, because of how Job._is_runnable() works.
  3. In a minute or so, another squeue command is executed, while the previous job is still on queue. This last part is key for the issue we're seeing, because the log.<task>.X won't be generated. Take this into account when reading below.
  4. This time the squeue command fails with whatever error code, let's assume -1, not relevant. With the code as it previously was, there was no distinction for any return value different from zero when submitting a squeue command, so the manager assumes there's some stdout (empty) and some stderr (probably with some text), but the manager is only interested in the stdout. After parsing it (empty), it realizes that there's no job/task pair waiting in the queue (which is wrong). Moreover, there's no log.<task>.X file ("log file attached") in the filesystem, as per (3). As a consequence, the job/task pair is detected as runnable, and is thus sent into the queue shortly via sbatch.

Note that the wrong behavior from the unintended behavior of "same job queued twice" comes from the sbatch command failing to be rerun whenever its return code is different from zero, as stated above. Therefore, I believe my proposed code isn't a workaround at all, but an actual solution to our issue.

I hope I have clarified your doubts. Please let me know if there are still any questions or comments.

Icemole added 3 commits July 18, 2024 04:11
The queue submit command is already handling return values below
@albertz
Copy link
Member

albertz commented Jul 18, 2024

This time the squeue command fails with whatever error code, let's assume -1, not relevant. With the code as it previously was, there was no distinction for any return value different from zero when submitting a squeue command

What do you mean "the code as it previously was"? The error code?

What do you mean "no distinction for any return value different from zero"? I fail to parse the grammar of this sentence. The squeue command has a return value (error code) different from 0, as you said, or not?

I'm looking at queue_state now. One problem seems to be that we parse the stdout no matter of the return value (retval).

But ok. It returns an empty defaultdict(list) then. The only place queue_state is used is in the task_state function. It should run into this code:

queue_state = self.queue_state()
qs = queue_state[task_name]
if qs == []:  # <--- this should be True
    return STATE_UNKNOWN

So it will return STATE_UNKNOWN. But that is different from STATE_RUNNABLE?

@albertz
Copy link
Member

albertz commented Jul 18, 2024

Ah, in Task._get_state_helper():

if engine_state == gs.STATE_UNKNOWN:
    if self.started(task_id):
        ...
    else:
        return gs.STATE_RUNNABLE

Task.started() will return False when the log file does not exist.
And that is what you get in Task.state() then. So that is why it thinks this task is runnable.

But isn't that the main problem here? This logic I just showed from Task._get_state_helper, that STATE_UNKNOWN and log file does not exist -> STATE_RUNNABLE? Shouldn't we change/fix that? In what cases does this logic make sense?

@michelwi
Copy link
Contributor

But isn't that the main problem here? This logic I just showed from Task._get_state_helper, that STATE_UNKNOWN and log file does not exist -> STATE_RUNNABLE? Shouldn't we change/fix that? In what cases does this logic make sense?

This logic is used in the case when a job is (actually) runnable: STATE_UNKNOWN in the engine means "This job is not known to the engine". There could be three reasons for it:

  1. it has not been submitted yet. This is what sisyphus assumes is the case here and (re-)submits the job
  2. the squeue call failed. This is not handled in the master branch and likely causes the problem
  3. the job is finished. In this case the worker has set the finished marker and _get_state_helper will return STATE_FINISHED here

To fix the problem we should fix 2. I tend to agree with Albert that

One problem seems to be that we parse the stdout no matter of the return value (retval).

In SGE the _task_info_cache is not updated when there is a parsing error maybe we should do something similar here as well?

As an aside: I noticed that, when a job is submitted, it is added to the _task_info_cache as PENDING and even when a reset of the cache is requested, it is not actually dropped, only the next call to queue_state will trigger a call to squeu.

So if we avoid setting the _task_info_cache to an empty dict, we should be good.

@albertz
Copy link
Member

albertz commented Jul 18, 2024

This logic is used in the case when a job is (actually) runnable: STATE_UNKNOWN in the engine means "This job is not known to the engine".

If that is really what STATE_UNKNOWN means, then it means, task_state() is wrong, and we should fix that.

@michelwi
Copy link
Contributor

If that is really what STATE_UNKNOWN means,

That would be my interpretation based on the code.

then it means, task_state() is wrong, and we should fix that.

how would you propose to fix that?

@albertz
Copy link
Member

albertz commented Jul 18, 2024

then it means, task_state() is wrong, and we should fix that.

how would you propose to fix that?

Maybe introduce a new state STATE_QUEUE_ERROR or so, which tells the engine that the job is in an unknown state.

Oh, I just saw, there is already such a state. I think queue_state should throw an exception if retval != 0, and we catch that exception in task_state, and return STATE_QUEUE_ERROR if we get this.

I also wonder, when I look at other engines task_state, they all return STATE_QUEUE_ERROR on error. Only the SLURM engine task_state does not.

@albertz
Copy link
Member

albertz commented Jul 18, 2024

I've added STATE_QUEUE_ERROR just like in the rest of the engines.

But you still have this wrong code there?

if qs == []:
    return STATE_UNKNOWN

Sorry, see my other comment. This should be correct. But what you actually need to add is sth like:

try:
    queue_state = self.queue_state()
except subprocess.CalledProcessError:
    return STATE_QUEUE_ERROR

@curufinwe
Copy link
Collaborator

Imo we should not throw an exception if the parsing of the queue output fails, but treat it in the same way as a timeout error and assume that the error is transient and will go away (either by itself or through admin intervention). Thus we should just keep on trying until we get a correct output from squeue.

@Icemole
Copy link
Collaborator Author

Icemole commented Aug 19, 2024

we should not throw an exception if the parsing of the queue output fails, but treat it in the same way as a timeout error and assume that the error is transient and will go away (either by itself or through admin intervention).

That's more or less what we were doing before, and what we're doing now with the new functionality as described by @albertz.

@Icemole Icemole requested a review from michelwi August 19, 2024 08:51
Copy link
Contributor

@michelwi michelwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not throw an exception if the parsing of the queue output fails, but treat it in the same way as a timeout error and assume that the error is transient and will go away (either by itself or through admin intervention).

That's more or less what we were doing before, and what we're doing now with the new functionality as described by @albertz.

Yes, but it is not retried very often. I tested it now and found

[2024-08-19 10:09:39,353] INFO: Finished updating job states
[2024-08-19 10:09:39,367] INFO: Experiment directory: /path/to/my/setup_folder     Call: /path/to/sisyphus/sis m config/my_config.py
[2024-08-19 10:09:39,387] INFO: queue_error(11) waiting(72)
Print verbose overview (v), update aliases and outputs (u), start manager (y), or exit (n)? y
[2024-08-19 10:10:38,269] INFO: There is nothing I can do, good bye!

after retrying once (after 1 min) the manager terminates

@Icemole
Copy link
Collaborator Author

Icemole commented Aug 19, 2024

after retrying once (after 1 min) the manager terminates

I would definitely always retry. This seems to be a rather small issue. Should the STATE_QUEUE_ERROR be redefined for this purpose?

@michelwi
Copy link
Contributor

after retrying once (after 1 min) the manager terminates

I would definitely always retry. This seems to be a rather small issue. Should the STATE_QUEUE_ERROR be redefined for this purpose?

I think this would need be added in the manager work_left function. But not sure about other side effects.

@Icemole
Copy link
Collaborator Author

Icemole commented Aug 19, 2024

We're forgetting that QUEUE_ERROR represents a job in Eqw (SGE equivalent) state, so I believe the error on QUEUE_ERROR would be totally correct and we mustn't modify that. Maybe we should do as Eugen suggested and make it a RUNNABLE task instead of having it fail with QUEUE_ERROR.

Or alternatively, create a new SUBMISSION_ERROR state that directly transitions to RUNNABLE or acts similarly to RUNNABLE. But for that I would simply use RUNNABLE.

@Icemole Icemole requested review from michelwi and albertz August 20, 2024 05:53
@Icemole
Copy link
Collaborator Author

Icemole commented Aug 20, 2024

So we can't use STATE_RUNNABLE because there's an assertion when getting the state of a job:

File ~/work/sisyphus/too-many-open-files-fix/recipe/sisyphus/sisyphus/task.py:371, in Task._get_state_helper(self=<Task 'run' job=Job<work/i6_core/recognition/conversion/LatticeToCtmJob.Cy7LtqTAYoNa>>, engine=<sisyphus.engine.EngineSelector object>, task_id=1)
    369 else:
    370     engine_state = engine.task_state(self, task_id)
--> 371     assert engine_state in (gs.STATE_QUEUE, gs.STATE_QUEUE_ERROR, gs.STATE_RUNNING, gs.STATE_UNKNOWN)

I think implementing Eugen's plan would be best: treat the wrong queue command right after it fails, just like a timeout error. There are some alternatives:

  1. Returning to a previous version from this same PR in which that was done: see here.
  2. In a solution that might make everyone more comfortable, the code could be something like:
def task_state(self, task, task_id):
    # ...
    while True:
        try:
            queue_state = self.queue_state()
            break
        except subprocess.CalledProcessError:
            pass

Note: we can add some functionality to make the code do that, say, 10 times, and actually enter STATE_QUEUE_ERROR if it can't be scheduled in that amount of tries.

As a reminder, the current code is:

def task_state(self, task, task_id):
    # ...
    try:
        queue_state = self.queue_state()
    except subprocess.CalledProcessError:
        return STATE_QUEUE_ERROR  # or STATE_RUNNABLE

which is wrong. STATE_QUEUE_ERROR is meant to be used for a job in Eqw state. And I don't feel very comfortable modifying code inside the Engine class so that the queue state accepts STATE_RUNNABLE as well just for this.

@michelwi
Copy link
Contributor

STATE_QUEUE_ERROR is meant to be used for a job in Eqw state.

ok.

and make it a RUNNABLE task instead

that is wrong as well. A runnable task will be (re-)submitted by the manager. This is exactly the behavior before the PR that we wanted to fix.

Thus we should just keep on trying until we get a correct output from squeue.

I agree here. My proposal how to do this would be similar to how sge does it

# parse qstat output
try:
etree = xml.etree.cElementTree.fromstring(xml_data)
except xml.etree.cElementTree.ParseError:
logging.warning(
"qstat -xml parsing error, retrying\n"
"command: %s\n"
"stdout: %s\n"
"stderr: %s\n"
"return value: %s" % (system_command, out, err, retval)
)
time.sleep(gs.WAIT_PERIOD_QSTAT_PARSING)
return self.queue_state()

If there is an error, just retry. This might (probably) also block the manager from submitting while the problem persists.

Now we only check whether the return value is != 0, and if so, we wait a bit and rerun the command. Note that this might block the manager
@Icemole
Copy link
Collaborator Author

Icemole commented Aug 23, 2024

I think this is done. Please check if you like this approach.

Copy link
Contributor

@michelwi michelwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is done. Please check if you like this approach.

yes, thank you.

sisyphus/aws_batch_engine.py Outdated Show resolved Hide resolved
Copy link
Contributor

@michelwi michelwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this version on slurm and it works as expected.

@michelwi michelwi merged commit da7f29c into master Aug 26, 2024
3 checks passed
@albertz albertz deleted the handle-bad-return-value branch August 26, 2024 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

More then one matching SLURM task (again)
5 participants