Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shell: make inter-shell services idempotent #3903

Open
grondo opened this issue Oct 5, 2021 · 2 comments
Open

shell: make inter-shell services idempotent #3903

grondo opened this issue Oct 5, 2021 · 2 comments

Comments

@grondo
Copy link
Contributor

grondo commented Oct 5, 2021

A comment from @garlick in #3900

If I recall, the output protocol doesn't even have responses. If we added responses, then if we automatically retried on reconnect, we could end up with duplicate requests at the server end if the original request made it before the connection was lost, but the response did not, which means duplicate lines in the output. If we could change the output protocol to be idempotent (like put line by number, where the server side tracks which line number range it has already seen), then maybe we can fix that one.

If the shell supports offline/reconnect, any inter-shell RPCs that failed during as a result will need to be retried. It sounds like this is only practical if all requests are idempotent.

It may be that the only service we definitely need to make idempotent is the shell output service.

If senders add a sequence number to requests to write output to rank 0, then rank 0 can easily discard duplicates.

We'll have to investigate any other services that will need this treatment. It doesn't feel like this should be required for startup services like the pmi plugin, or services that are used from outside the shell like mpir or pty.

Unfortunately this same requirement will be required for all out-of-tree plugins, which makes development of these plugins a lot more complex.

@garlick
Copy link
Member

garlick commented Oct 5, 2021

The same problem could apply to regular services used by the shell. For example, if the shell raises an exception and receives an error indicating the connector reconnected, should it raise the exception again? I forget if the shell posts directly to the eventlog or goes through the exec service but that's another one where an "append" is not idempotent.

@grondo
Copy link
Contributor Author

grondo commented Oct 5, 2021

No it goes through the job manager via flux_job_raise(3). Not sure of the repercussions of having two exceptions in the eventlog. Seems like the least of our problems at this point.

Since the shell uses many services, do we have to go through every service in Flux and make them all idempotent?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants