-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Queue scan: handle return value different from zero #198
Conversation
Whenever the return value is different from zero, the job state is eventually set as UNKNOWN. If the job is set as UNKNOWN and there's no log file attached, the job will be automatically queued. This is very dangerous when there's a job already queued "naturally", that is, queued because the queue scan job finished gracefully, found nothing, and set the job state as UNKNOWN. This wrong process can happen many times, until the first job scheduled eventually enters RUNNING state, and thus a log file is generated, preventing any further same job schedulings.
Copy-paste issue :/
sisyphus/simple_linux_utility_for_resource_management_engine.py
Outdated
Show resolved
Hide resolved
if retval != 0: | ||
logging.warning(self._system_call_error_warn_msg(system_command)) | ||
time.sleep(gs.WAIT_PERIOD_BETWEEN_CHECKS) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this solves the problem.. If there is some fundamental problem with the queue, that is not recoverable, then this sends the function into an endless loop.
maybe instead just return the old cache?
if retval != 0: | |
logging.warning(self._system_call_error_warn_msg(system_command)) | |
time.sleep(gs.WAIT_PERIOD_BETWEEN_CHECKS) | |
continue | |
if retval != 0: | |
return self._task_info_cache |
In the worst case we will never update the cache and a running/finished Job will still be marked as pending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO I would log and wait, so that the normal cache update process can run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we should definitively not use gs.WAIT_PERIOD_BETWEEN_CHECKS
as that is used for
sisyphus/sisyphus/global_settings.py
Lines 184 to 185 in ce5f7a2
#: How often should the manager check for finished jobs | |
WAIT_PERIOD_BETWEEN_CHECKS = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
define a new variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found gs.WAIT_PERIOD_QSTAT_PARSING
. I assume QSTAT
implies that sisyphus was originally intended to only run with SGE, and the functionality was extended afterwards. I would only use this variable and not create any other "more generic" such as gs.WAIT_PERIOD_QUEUE_PARSING
.
Is this related to #175? Or is there any issue which describes the problem? |
if retval != 0: | ||
logging.warning(self._system_call_error_warn_msg(system_command)) | ||
time.sleep(gs.WAIT_PERIOD_BETWEEN_CHECKS) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO I would log and wait, so that the normal cache update process can run
yes, possibly |
Yes, definitely, this is what I've been observing for quite some time now, and I believe handling this here fixes the issue :) |
sisyphus/simple_linux_utility_for_resource_management_engine.py
Outdated
Show resolved
Hide resolved
What do you mean by "queue scan command"? You mean what happens in the What do you mean by "log file attached"?
What do you mean by "finished gracefully"? Above you said the return value is different from zero? So it's not finished gracefully then? Or is it? What do you mean by "there's usually a job already queued "naturally", that is, queued because the queue scan job finished gracefully"? Why is the job already queued? If the command exists with non-zero code, usually the job is not queued, or not? What do you mean by "queue scan job"? Before you said "queue scan command". Do you mean the same? Do you actually mean the
What do you mean by "scan"? Sorry, I don't fully understand the actual problem. Aren't you saying there is a bug in |
@albertz I'll answer you in line.
Yes, I mean what happens in the
I mean the log file attached to the job/task pair. For instance, if we're running some job whose task is
Yes, I mean the same, apologies for the confusing nomenclature. I mean running
Running
In my original comment I was referring to two situations that interact wrongly among each other, let me detail them further (assume SLURM is the manager):
Note that the wrong behavior from the unintended behavior of "same job queued twice" comes from the I hope I have clarified your doubts. Please let me know if there are still any questions or comments. |
The queue submit command is already handling return values below
What do you mean "the code as it previously was"? The error code? What do you mean "no distinction for any return value different from zero"? I fail to parse the grammar of this sentence. The I'm looking at But ok. It returns an empty queue_state = self.queue_state()
qs = queue_state[task_name]
if qs == []: # <--- this should be True
return STATE_UNKNOWN So it will return |
Ah, in if engine_state == gs.STATE_UNKNOWN:
if self.started(task_id):
...
else:
return gs.STATE_RUNNABLE
But isn't that the main problem here? This logic I just showed from |
This logic is used in the case when a job is (actually) runnable:
To fix the problem we should fix 2. I tend to agree with Albert that
In SGE the As an aside: I noticed that, when a job is submitted, it is added to the So if we avoid setting the |
If that is really what |
That would be my interpretation based on the code.
how would you propose to fix that? |
Maybe introduce a new state Oh, I just saw, there is already such a state. I think I also wonder, when I look at other engines |
if qs == []:
return STATE_UNKNOWN Sorry, see my other comment. This should be correct. But what you actually need to add is sth like: try:
queue_state = self.queue_state()
except subprocess.CalledProcessError:
return STATE_QUEUE_ERROR |
Imo we should not throw an exception if the parsing of the queue output fails, but treat it in the same way as a timeout error and assume that the error is transient and will go away (either by itself or through admin intervention). Thus we should just keep on trying until we get a correct output from squeue. |
That's more or less what we were doing before, and what we're doing now with the new functionality as described by @albertz. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not throw an exception if the parsing of the queue output fails, but treat it in the same way as a timeout error and assume that the error is transient and will go away (either by itself or through admin intervention).
That's more or less what we were doing before, and what we're doing now with the new functionality as described by @albertz.
Yes, but it is not retried very often. I tested it now and found
[2024-08-19 10:09:39,353] INFO: Finished updating job states
[2024-08-19 10:09:39,367] INFO: Experiment directory: /path/to/my/setup_folder Call: /path/to/sisyphus/sis m config/my_config.py
[2024-08-19 10:09:39,387] INFO: queue_error(11) waiting(72)
Print verbose overview (v), update aliases and outputs (u), start manager (y), or exit (n)? y
[2024-08-19 10:10:38,269] INFO: There is nothing I can do, good bye!
after retrying once (after 1 min) the manager terminates
sisyphus/simple_linux_utility_for_resource_management_engine.py
Outdated
Show resolved
Hide resolved
I would definitely always retry. This seems to be a rather small issue. Should the |
I think this would need be added in the manager work_left function. But not sure about other side effects. |
We're forgetting that Or alternatively, create a new |
So we can't use
I think implementing Eugen's plan would be best: treat the wrong queue command right after it fails, just like a timeout error. There are some alternatives:
Note: we can add some functionality to make the code do that, say, 10 times, and actually enter As a reminder, the current code is:
which is wrong. |
ok.
that is wrong as well. A runnable task will be (re-)submitted by the manager. This is exactly the behavior before the PR that we wanted to fix.
I agree here. My proposal how to do this would be similar to how sge does it sisyphus/sisyphus/son_of_grid_engine.py Lines 329 to 342 in 4ca86d2
If there is an error, just retry. This might (probably) also block the manager from submitting while the problem persists. |
Now we only check whether the return value is != 0, and if so, we wait a bit and rerun the command. Note that this might block the manager
I think this is done. Please check if you like this approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is done. Please check if you like this approach.
yes, thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this version on slurm and it works as expected.
Fix #175.
Whenever the return value of the queue scan command is different from zero, the job states are eventually set as
UNKNOWN
. Whenever some job is set asUNKNOWN
and there's no log file attached, the job will be automatically considered asRUNNABLE
and queued.This is a very dangerous process because there's usually a job already queued "naturally", that is, queued because the queue scan job finished gracefully, found nothing, and set the job state as
UNKNOWN
, thus queuing the job normally.This wrong process can happen many times, until the first job scheduled eventually enters
RUNNING
state, and thus a log file is generated, preventing any further same job schedulings.This PR fixes such wrong behavior by making the queue scan wait some seconds before rescheduling it again after a failure in the scan, and not directly setting the job state as
UNKNOWN
.