-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More then one matching SLURM task #163
Comments
This seems really strange to me. You see in the log:
The "Submitted with job_id" message is correct, it is indeed submitted. |
Maybe this is some multi-threading bug? I see that |
Very strange, especially since both submissions are nearly an hour apart make. This basically eliminates the possibility of caching effects. Looks more like sisyphus did not find the job when checking if the job was submitted, but for some reason it found when checking the current status. Which is strange... |
I got this again:
So it looks a bit like it incorrectly identified it as |
I wondered how it can be in this state interrupted_resumable. In
engine_state = engine.task_state(self, task_id)
...
if engine_state == gs.STATE_UNKNOWN:
if self.started(task_id):
# check again if it finished or crashed while retrieving the state
if self.finished(task_id):
return gs.STATE_FINISHED
elif self.error(task_id):
return gs.STATE_ERROR
# job logging file got updated recently, assume job is still running.
# used to avoid wrongly marking jobs as interrupted do to slow filesystem updates
elif self.running(task_id):
return gs.STATE_RUNNING
history = [] if engine is None else engine.get_submit_history(self)
if history and len(history[task_id]) > gs.MAX_SUBMIT_RETRIES:
# More then three tries to run this task, something is wrong
return gs.STATE_RETRY_ERROR
else:
# Task was started, but isn't running anymore => interrupted
if self._resume is None:
return gs.STATE_INTERRUPTED_NOT_RESUMABLE
else:
return gs.STATE_INTERRUPTED_RESUMABLE # <-- going here
So, the main problem here is that |
Ah, I think I found the problem: In logging.info("Submitted with job_id: %s %s" % (job_id, name))
for task_id in range(start_id, end_id, step_size):
self._task_info_cache[(name, task_id)].append((job_id, "PD")) But then in state = qs[0][1]
if state in ["RUNNING", "COMPLETING"]:
return STATE_RUNNING
elif state in ["PENDING", "CONFIGURING"]:
return STATE_QUEUE
else:
return STATE_UNKNOWN I.e. |
submit_helper writes an invalid state to _task_info_cache, and then task_state returns STATE_UNKNOWN, which causes STATE_INTERRUPTED_RESUMABLE, which causes a resubmit. Fix #163
submit_helper writes an invalid state to _task_info_cache, and then task_state returns STATE_UNKNOWN, which causes STATE_INTERRUPTED_RESUMABLE, which causes a resubmit. Fix #163
This is the same message as in #156, but I think the bug (problem) here is different, because from the log, it seems there is no "Error to submit job", and the problem in #156 is also fixed already by #157, and I have seen multiple times that #157 is indeed working as intended.
Maybe it is relevant that when I interrupted the manager here, I got the crash from #164.
The text was updated successfully, but these errors were encountered: