dag_bootstrap should abort if not properly registered in Task DB #6151

belforte · 2020-09-08T22:07:47Z

again on the topic of tasks submitted multiple times, following up on #6145
We still do not have a totally foul-proof protocol. In particular the schedd may become
unresponsive right after a dag_boostratp job has been released, so the TaskWorker
seen an error but it is unable to remove the job and goes on in its retry loop eventually
submitting same task to another schedd.

Rare, but possible, and by Murphy's law, we should protect against it.

One way is a more complex task state machine where the task is put in a different state until
proper submission can be confirmed, as suggested by Brian:

On 03/09/2020 20:26, Brian Paul Bockelman wrote:

>  From the CRAB side, we probably ought to have a separate state here -
> let's call it "SUBMITTED" and "ACTIVE".
> 
> 1. Job is submitted to schedd, gets a cluster ID back.  Cluster ID is stored
> in the task's description and task moves to the state SUBMITTED.
> 2. Job spooling is attempted.  If successful, job is moved into "ACTIVE" state
> and delete the input files from the TaskWorker.
> 3. If spooling is unsuccessful, job stays in the "SUBMITTED" state.
>  On the next iteration, the TaskWorker
>  queries the status of all tasks in SUBMITTED state
> (you can do that as you recorded the Cluster ID).
>  If the job is released, then mark the task as ACTIVE and delete the input files from the TW. 
> If the job is still held due to spooling, go to step (2).

I think it is simpler to do the same by the other end: since the dag_bootstrap already has a step
where it communicates to the DB, to upload the web directory and fail if it can't, this could be
extended to check that the task is indeed in SUBMIUTTED status (i.e. TaskWorker has finished
working on it) and that the schedd/cluster-id recorded in the DB match the current job:


On 04/09/2020 00:35, Stefano Belforte wrote:
> Of course condor_rm may always fail, esp. is schedd is
> busy (or spool would have worked to begin with).. but let's see.
> Maybe the dagbootstrap can check to be an acknowledged child 
> e.g. when it uploads WEBDIR location, an useless action
> per se' as we know very well where the webdir is). 
> 
> The more I think about this, the more I like that if I am the dagbootstrap,
> I want to be sure that this task is registered in the DB with status SUBMITTED
> and with the same clusterId and schedd name
> which I have, before I unleash my DAGMAN.

The text was updated successfully, but these errors were encountered:

abort dag_bootstrap if task already runs (fix #6151)

belforte added Type: Bug Priority: High labels Sep 8, 2020

belforte self-assigned this Sep 8, 2020

belforte added the Type: Enhancement label Sep 8, 2020

belforte mentioned this issue Dec 23, 2020

Publication count does not match number of completed jobs #6308

Closed

belforte assigned ddaina Jan 4, 2021

ddaina closed this as completed in a66ecd8 Jan 6, 2021

ddaina added a commit that referenced this issue Jan 6, 2021

Merge pull request #6320 from ddaina/fix_#6151

e493458

abort dag_bootstrap if task already runs (fix #6151)

ddaina mentioned this issue Jan 18, 2021

TW should restart, not kill, tasks found in QUEUED #6347

Closed

belforte mentioned this issue May 16, 2024

AdjustSites.py race condition #8411

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dag_bootstrap should abort if not properly registered in Task DB #6151

dag_bootstrap should abort if not properly registered in Task DB #6151

belforte commented Sep 8, 2020

dag_bootstrap should abort if not properly registered in Task DB #6151

dag_bootstrap should abort if not properly registered in Task DB #6151

Comments

belforte commented Sep 8, 2020