You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If I recall, the output protocol doesn't even have responses. If we added responses, then if we automatically retried on reconnect, we could end up with duplicate requests at the server end if the original request made it before the connection was lost, but the response did not, which means duplicate lines in the output. If we could change the output protocol to be idempotent (like put line by number, where the server side tracks which line number range it has already seen), then maybe we can fix that one.
If the shell supports offline/reconnect, any inter-shell RPCs that failed during as a result will need to be retried. It sounds like this is only practical if all requests are idempotent.
It may be that the only service we definitely need to make idempotent is the shell output service.
If senders add a sequence number to requests to write output to rank 0, then rank 0 can easily discard duplicates.
We'll have to investigate any other services that will need this treatment. It doesn't feel like this should be required for startup services like the pmi plugin, or services that are used from outside the shell like mpir or pty.
Unfortunately this same requirement will be required for all out-of-tree plugins, which makes development of these plugins a lot more complex.
The text was updated successfully, but these errors were encountered:
The same problem could apply to regular services used by the shell. For example, if the shell raises an exception and receives an error indicating the connector reconnected, should it raise the exception again? I forget if the shell posts directly to the eventlog or goes through the exec service but that's another one where an "append" is not idempotent.
No it goes through the job manager via flux_job_raise(3). Not sure of the repercussions of having two exceptions in the eventlog. Seems like the least of our problems at this point.
Since the shell uses many services, do we have to go through every service in Flux and make them all idempotent?
A comment from @garlick in #3900
If the shell supports offline/reconnect, any inter-shell RPCs that failed during as a result will need to be retried. It sounds like this is only practical if all requests are idempotent.
It may be that the only service we definitely need to make idempotent is the shell
output
service.If senders add a sequence number to requests to write output to rank 0, then rank 0 can easily discard duplicates.
We'll have to investigate any other services that will need this treatment. It doesn't feel like this should be required for startup services like the pmi plugin, or services that are used from outside the shell like mpir or pty.
Unfortunately this same requirement will be required for all out-of-tree plugins, which makes development of these plugins a lot more complex.
The text was updated successfully, but these errors were encountered: