You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
again on the topic of tasks submitted multiple times, following up on #6145
We still do not have a totally foul-proof protocol. In particular the schedd may become
unresponsive right after a dag_boostratp job has been released, so the TaskWorker
seen an error but it is unable to remove the job and goes on in its retry loop eventually
submitting same task to another schedd.
Rare, but possible, and by Murphy's law, we should protect against it.
One way is a more complex task state machine where the task is put in a different state until
proper submission can be confirmed, as suggested by Brian:
On 03/09/2020 20:26, Brian Paul Bockelman wrote:
> From the CRAB side, we probably ought to have a separate state here -
> let's call it "SUBMITTED" and "ACTIVE".
>
> 1. Job is submitted to schedd, gets a cluster ID back. Cluster ID is stored
> in the task's description and task moves to the state SUBMITTED.
> 2. Job spooling is attempted. If successful, job is moved into "ACTIVE" state
> and delete the input files from the TaskWorker.
> 3. If spooling is unsuccessful, job stays in the "SUBMITTED" state.
> On the next iteration, the TaskWorker
> queries the status of all tasks in SUBMITTED state
> (you can do that as you recorded the Cluster ID).
> If the job is released, then mark the task as ACTIVE and delete the input files from the TW.
> If the job is still held due to spooling, go to step (2).
I think it is simpler to do the same by the other end: since the dag_bootstrap already has a step
where it communicates to the DB, to upload the web directory and fail if it can't, this could be
extended to check that the task is indeed in SUBMIUTTED status (i.e. TaskWorker has finished
working on it) and that the schedd/cluster-id recorded in the DB match the current job:
On 04/09/2020 00:35, Stefano Belforte wrote:
> Of course condor_rm may always fail, esp. is schedd is
> busy (or spool would have worked to begin with).. but let's see.
> Maybe the dagbootstrap can check to be an acknowledged child
> e.g. when it uploads WEBDIR location, an useless action
> per se' as we know very well where the webdir is).
>
> The more I think about this, the more I like that if I am the dagbootstrap,
> I want to be sure that this task is registered in the DB with status SUBMITTED
> and with the same clusterId and schedd name
> which I have, before I unleash my DAGMAN.
The text was updated successfully, but these errors were encountered:
again on the topic of tasks submitted multiple times, following up on #6145
We still do not have a totally foul-proof protocol. In particular the schedd may become
unresponsive right after a dag_boostratp job has been released, so the TaskWorker
seen an error but it is unable to remove the job and goes on in its retry loop eventually
submitting same task to another schedd.
Rare, but possible, and by Murphy's law, we should protect against it.
One way is a more complex task state machine where the task is put in a different state until
proper submission can be confirmed, as suggested by Brian:
I think it is simpler to do the same by the other end: since the dag_bootstrap already has a step
where it communicates to the DB, to upload the web directory and fail if it can't, this could be
extended to check that the task is indeed in SUBMIUTTED status (i.e. TaskWorker has finished
working on it) and that the schedd/cluster-id recorded in the DB match the current job:
The text was updated successfully, but these errors were encountered: