Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

verdi processes play --all does not work for process in Transport task update was cancelled status #6569

Open
superstar54 opened this issue Sep 23, 2024 · 4 comments

Comments

@superstar54
Copy link
Member

superstar54 commented Sep 23, 2024

Here are my processes

13709  2D ago     Cp2kCalculation         ⏵ Waiting        Transport task update was cancelled
13713  2D ago     Cp2kCalculation         ⏵ Waiting        Transport task update was cancelled
13714  2D ago     Cp2kCalculation         ⏵ Waiting        Transport task update was cancelled
13715  2D ago     Cp2kCalculation         ⏵ Waiting        Transport task update was cancelled
13716  2D ago     Cp2kCalculation         ⏵ Waiting        Monitoring scheduler: job state RUNNING
13717  2D ago     Cp2kCalculation         ⏵ Waiting        Transport task update was cancelled
13719  2D ago     Cp2kCalculation         ⏵ Waiting        Transport task update was cancelled
13720  2D ago     Cp2kCalculation         ⏵ Waiting        Transport task update was cancelled
13721  2D ago     Cp2kCalculation         ⏵ Waiting        Transport task update was cancelled

I want to play all. But verdi processes play --all does not work.

> verdi process play --all                                                         (science) 
Report: no active processes selected.

I have to run verdi processes play 13709 13721 with the pk explicitly

@unkcpz
Copy link
Member

unkcpz commented Oct 1, 2024

Maybe you just did restart your daemon and not wait long enough before you running the command? If you wait long enough I guess those process will change to the running state?
If this can be reproduced, would interesting to see verdi daemon logshow if workers are resuming process.

@superstar54
Copy link
Member Author

Hi @unkcpz , thanks for your suggestion!

If you wait long enough I guess those process will change to the running state?

What's the reason behind this?

@unkcpz
Copy link
Member

unkcpz commented Oct 16, 2024

What I expected in your case was maybe you just restart daemon (therefore workers) and all the processes need to reload in to the event loop.

If I understand correctly about what happened when worker started, it create runner and call _continue on the processes id which will recreate the process instance from checkpoint by load_checkpoint. It is a function in plumpy but it was called by the ProcessLauncher which is subscribed to the worker (runner) when it starts.

This procedure is not cheap since it requires communication with database. If you have quite many process to load, then it can takes minutes for workers to response the action to process as you expected.

What I guess that you can then run verdi processes play <pk> is because time passed and it able to response. Maybe I am totally wrong.

Another guess might be verdi process play --all uses the broadcast channel while verdi prcocess play use the rpc channel. The RMQ side has problem with broadcast one but it is fine with the rpc channel. I don't know how difficult it is to check this hypothesis with RMQ.

@agoscinski
Copy link
Contributor

agoscinski commented Oct 19, 2024

I could recreate the bug locally when cancelling the running process by killing the daemon worker and then reloading it by spawning a new daemon worker. This however only happens, when the process is paused and played once before the procedure, otherwise the daemon worker picks up the task automatically to the running state and there is no need to play process.

It seems to be because the processes state is set to paused in the checkpoint, but not in the database. So the query that retrieves all paused processes when running verdi process kill --all will return an empty list, while when sending the play command to the worker that reloaded the process from checkpoint, the process.paused is True.

That must be a bug with the persistence, that the checkpoint is not updated when the process is played. By that it is probably not picked up automatically when reloaded, because it is paused.

We could also make the play command in general more robust to failure by changing the query to affect all states that are not terminated. I guess the performance costs are negligible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants