`verdi processes play --all` does not work for process in `Transport task update was cancelled` status #6569

superstar54 · 2024-09-23T12:19:56Z

Here are my processes

13709  2D ago     Cp2kCalculation         ⏵ Waiting        Transport task update was cancelled
13713  2D ago     Cp2kCalculation         ⏵ Waiting        Transport task update was cancelled
13714  2D ago     Cp2kCalculation         ⏵ Waiting        Transport task update was cancelled
13715  2D ago     Cp2kCalculation         ⏵ Waiting        Transport task update was cancelled
13716  2D ago     Cp2kCalculation         ⏵ Waiting        Monitoring scheduler: job state RUNNING
13717  2D ago     Cp2kCalculation         ⏵ Waiting        Transport task update was cancelled
13719  2D ago     Cp2kCalculation         ⏵ Waiting        Transport task update was cancelled
13720  2D ago     Cp2kCalculation         ⏵ Waiting        Transport task update was cancelled
13721  2D ago     Cp2kCalculation         ⏵ Waiting        Transport task update was cancelled

I want to play all. But verdi processes play --all does not work.

> verdi process play --all                                                         (science) 
Report: no active processes selected.

I have to run verdi processes play 13709 13721 with the pk explicitly

The text was updated successfully, but these errors were encountered:

unkcpz · 2024-10-01T14:45:07Z

Maybe you just did restart your daemon and not wait long enough before you running the command? If you wait long enough I guess those process will change to the running state?
If this can be reproduced, would interesting to see verdi daemon logshow if workers are resuming process.

superstar54 · 2024-10-15T19:32:14Z

Hi @unkcpz , thanks for your suggestion!

If you wait long enough I guess those process will change to the running state?

What's the reason behind this?

unkcpz · 2024-10-16T14:03:26Z

What I expected in your case was maybe you just restart daemon (therefore workers) and all the processes need to reload in to the event loop.

If I understand correctly about what happened when worker started, it create runner and call _continue on the processes id which will recreate the process instance from checkpoint by load_checkpoint. It is a function in plumpy but it was called by the ProcessLauncher which is subscribed to the worker (runner) when it starts.

This procedure is not cheap since it requires communication with database. If you have quite many process to load, then it can takes minutes for workers to response the action to process as you expected.

What I guess that you can then run verdi processes play <pk> is because time passed and it able to response. Maybe I am totally wrong.

Another guess might be verdi process play --all uses the broadcast channel while verdi prcocess play use the rpc channel. The RMQ side has problem with broadcast one but it is fine with the rpc channel. I don't know how difficult it is to check this hypothesis with RMQ.

agoscinski · 2024-10-19T00:15:01Z

I could recreate the bug locally when cancelling the running process by killing the daemon worker and then reloading it by spawning a new daemon worker. This however only happens, when the process is paused and played once before the procedure, otherwise the daemon worker picks up the task automatically to the running state and there is no need to play process.

It seems to be because the processes state is set to paused in the checkpoint, but not in the database. So the query that retrieves all paused processes when running verdi process kill --all will return an empty list, while when sending the play command to the worker that reloaded the process from checkpoint, the process.paused is True.

That must be a bug with the persistence, that the checkpoint is not updated when the process is played. By that it is probably not picked up automatically when reloaded, because it is paused.

We could also make the play command in general more robust to failure by changing the query to affect all states that are not terminated. I guess the performance costs are negligible.

unkcpz · 2024-10-28T12:52:16Z

Hi @superstar54, can you confirm you still have this problem? If restart the daemon and wait couple of minutes will fix it. I'll close this now, fell free to reopen it if you see the problem again.

agoscinski · 2024-10-28T12:57:18Z

Discussed with @unkcpz, I will investigate this a bit more

superstar54 · 2024-10-28T13:00:21Z

@unkcpz , @agoscinski, thanks for looking into this issue. I don't get the Transport task update was cancelled at my work for the moment. Will try to restart and wait if I get into this situation.

superstar54 added the type/bug label Sep 23, 2024

agoscinski added topic/persistence priority/important labels Oct 19, 2024

agoscinski added the topic/engine label Oct 28, 2024

agoscinski assigned unkcpz Oct 28, 2024

unkcpz closed this as completed Oct 28, 2024

agoscinski assigned agoscinski and unassigned unkcpz Oct 28, 2024

unkcpz reopened this Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`verdi processes play --all` does not work for process in `Transport task update was cancelled` status #6569

`verdi processes play --all` does not work for process in `Transport task update was cancelled` status #6569

superstar54 commented Sep 23, 2024 •

edited

Loading

unkcpz commented Oct 1, 2024 •

edited

Loading

superstar54 commented Oct 15, 2024

unkcpz commented Oct 16, 2024

agoscinski commented Oct 19, 2024 •

edited

Loading

unkcpz commented Oct 28, 2024

agoscinski commented Oct 28, 2024

superstar54 commented Oct 28, 2024

verdi processes play --all does not work for process in Transport task update was cancelled status #6569

verdi processes play --all does not work for process in Transport task update was cancelled status #6569

Comments

superstar54 commented Sep 23, 2024 • edited Loading

unkcpz commented Oct 1, 2024 • edited Loading

superstar54 commented Oct 15, 2024

unkcpz commented Oct 16, 2024

agoscinski commented Oct 19, 2024 • edited Loading

unkcpz commented Oct 28, 2024

agoscinski commented Oct 28, 2024

superstar54 commented Oct 28, 2024

`verdi processes play --all` does not work for process in `Transport task update was cancelled` status #6569

`verdi processes play --all` does not work for process in `Transport task update was cancelled` status #6569

superstar54 commented Sep 23, 2024 •

edited

Loading

unkcpz commented Oct 1, 2024 •

edited

Loading

agoscinski commented Oct 19, 2024 •

edited

Loading