Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dag_bootstrap should abort if not properly registered in Task DB #6151

Closed
belforte opened this issue Sep 8, 2020 · 0 comments
Closed

dag_bootstrap should abort if not properly registered in Task DB #6151

belforte opened this issue Sep 8, 2020 · 0 comments

Comments

@belforte
Copy link
Member

belforte commented Sep 8, 2020

again on the topic of tasks submitted multiple times, following up on #6145
We still do not have a totally foul-proof protocol. In particular the schedd may become
unresponsive right after a dag_boostratp job has been released, so the TaskWorker
seen an error but it is unable to remove the job and goes on in its retry loop eventually
submitting same task to another schedd.

Rare, but possible, and by Murphy's law, we should protect against it.

One way is a more complex task state machine where the task is put in a different state until
proper submission can be confirmed, as suggested by Brian:

On 03/09/2020 20:26, Brian Paul Bockelman wrote:

>  From the CRAB side, we probably ought to have a separate state here -
> let's call it "SUBMITTED" and "ACTIVE".
> 
> 1. Job is submitted to schedd, gets a cluster ID back.  Cluster ID is stored
> in the task's description and task moves to the state SUBMITTED.
> 2. Job spooling is attempted.  If successful, job is moved into "ACTIVE" state
> and delete the input files from the TaskWorker.
> 3. If spooling is unsuccessful, job stays in the "SUBMITTED" state.
>  On the next iteration, the TaskWorker
>  queries the status of all tasks in SUBMITTED state
> (you can do that as you recorded the Cluster ID).
>  If the job is released, then mark the task as ACTIVE and delete the input files from the TW. 
> If the job is still held due to spooling, go to step (2).

I think it is simpler to do the same by the other end: since the dag_bootstrap already has a step
where it communicates to the DB, to upload the web directory and fail if it can't, this could be
extended to check that the task is indeed in SUBMIUTTED status (i.e. TaskWorker has finished
working on it) and that the schedd/cluster-id recorded in the DB match the current job:


On 04/09/2020 00:35, Stefano Belforte wrote:
> Of course condor_rm may always fail, esp. is schedd is
> busy (or spool would have worked to begin with).. but let's see.
> Maybe the dagbootstrap can check to be an acknowledged child 
> e.g. when it uploads WEBDIR location, an useless action
> per se' as we know very well where the webdir is). 
> 
> The more I think about this, the more I like that if I am the dagbootstrap,
> I want to be sure that this task is registered in the DB with status SUBMITTED
> and with the same clusterId and schedd name
> which I have, before I unleash my DAGMAN. 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants