Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job status transition in ReqClient vs JobStatus #7799

Open
iueda opened this issue Sep 19, 2024 · 6 comments
Open

job status transition in ReqClient vs JobStatus #7799

iueda opened this issue Sep 19, 2024 · 6 comments
Labels
Milestone

Comments

@iueda
Copy link
Contributor

iueda commented Sep 19, 2024

This code suggests transition of job status from 'Completed' to 'Killed'
https://github.com/DIRACGrid/DIRAC/blob/v8.0.51/src/DIRAC/RequestManagementSystem/Client/ReqClient.py#L371-L373

            if jobStatus == JobStatus.COMPLETED:
                ...
                elif jobMinorStatus == JobMinorStatus.MARKED_FOR_TERMINATION:
                    # If the job has been Killed, set it Killed
                    newJobStatus = JobStatus.KILLED
            ...
            stateUpdate = stateServer.setJobStatus(jobID, newJobStatus, JobMinorStatus.REQUESTS_DONE, "RMS")

but there is no such state transition defined in
https://github.com/DIRACGrid/DIRAC/blob/v8.0.51/src/DIRAC/WorkloadManagementSystem/Client/JobStatus.py#L88

            COMPLETED: State(11, [DONE, FAILED], defState=COMPLETED),

Which is the policy?

@fstagni
Copy link
Contributor

fstagni commented Sep 19, 2024

Just few days ago we created #7794 but we discussed that it would not make sense to set jobs from "COMPLETED" to "KILLED". The code above was introduced in #5650 and I do not find its operational source. I can see in LHCb that we have atm 2 jobs "DONE" with minorStatus "Marked for termination". I need some arguments on how to continue.

@fstagni fstagni added the WMS label Sep 19, 2024
@fstagni fstagni added this to the v8.0 milestone Sep 19, 2024
@iueda
Copy link
Contributor Author

iueda commented Sep 19, 2024

We have jobs in Status='Completed' and MinorStatus='Marked for termination'

Looking into the job LoggingInfo

JobWrapper           Completed   Pending Requests                   ... 
RMS(SM)              Completed   Requests done                      ...

Supposedly, the user killed the job while it was in RMS, with Status='Completed'.
The job could not be killed because the state transition is not allowed, and RMS could not
set it to 'Killed' for the same reason, I imagine.

@atsareg
Copy link
Contributor

atsareg commented Sep 19, 2024

I think In this case the final job state should be Done or Failed as in the checks just above. So, the lines https://github.com/DIRACGrid/DIRAC/blob/v8.0.51/src/DIRAC/RequestManagementSystem/Client/ReqClient.py#L371-L373 can be suppressed. If we have this case, it means that the job kill command was given too late to influence the job execution, the job was already executed, so the kill command should be just ignored. This can be coded in the JobManager.__killJob(): set the status to Killed and only if it is successful - set the minor status to "Marked for termination" as a separate call

@iueda
Copy link
Contributor Author

iueda commented Sep 20, 2024

I think In this case the final job state should be Done or Failed ... it means that the job kill command was given too late to influence the job execution, the job was already executed, so the kill command should be just ignored

That may be right from the system point of view. But requests may take time to be processed,
then the jobs stay in the non-terminal status even when the users would like to get rid of them.

This can be coded in the JobManager.__killJob(): set the status to Killed and only if it is successful - set the minor status to "Marked for termination" as a separate call

I thought "Marked for termination" is to be set to MinorStatus when the Status cannot be immediately set to "Killed", so that the status can go to "Killed" when there is no more bloker.

@atsareg
Copy link
Contributor

atsareg commented Dec 12, 2024

I would suggest to just set the job status to Failed rather than Killed assuming that if the user Killed the job something wrong happened with it based on the user's judgement

@iueda
Copy link
Contributor Author

iueda commented Dec 15, 2024

Are you suggesting that the code is to be changed from

                    newJobStatus = JobStatus.KILLED

to

                    newJobStatus = JobStatus.FAILED

?

I would suggest to just set the job status to Failed rather than Killed assuming that if the user Killed the job something wrong happened with it based on the user's judgement

Not necessarily... It may simply be that they couldn't wait until everything reaches Done/Failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants