-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
finish the leave process if broadcasting leave timeout #640
Merged
Merged
Changes from 2 commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth adding a comment to point out why we don't return on this "error" but continue to attempt to leave?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also do we need to do the same for the
Leave
call below since that will also error and return after the same timeout (i.e. with high probability in a large cluster).It might not be depending on what the calling code does. The benefit would be that we could be sure that we remain online for
LeavePropagationDelay
to increase chances of a clean leave. If the calling application just logs the error and exits anyway for example then it would be better to not return an error in the timeout case there too to make a best effort of allowing propagation despite not making it through the broadcast queues fully in the time allotted?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this I think:
Based on this, I think the best would be to wrap all the timeout errors that we could encounter during the sequence (both for Leave broadcast and the memberList leave) and if any return it at the end of the sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 If the error is a broadcast timeout though and we still went through all the steps, is there anything meaningful the caller can do other than log it? It doesn't make sense to call
Leave
again since it's already left in its own serf state and is just as likely to timeout on the broadcasts a second time unless it waits for a really long time for its broadcast queues to empty which is likely not a reasonable option when callingLeave
as 99% of the time this is during a process shutdown.Overall I think you're right, I just wonder what the value of reporting an error the user they can't do anything about is? At very least we should probably update the doc comment to point out that even if a timeout error is returned, the node will still have "left" the cluster, it'd just not possible to determine in the bounded time whether all other peers will have seen that. Technically it's not possible to know that for sure even if it doesn't time out actually, but it could be even more confusing to get a timeout error for an operation that is likely to have been successful 🤷 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking more about if the serf leave fail the calling app have another mean to reach other servers and notify them about it leaving, but that's a far fetch scenario I admit.
Yes it's a bit tricky because the timeout error in this case is more like, I can't confirm that I was able to notify the other nodes about leaving but I did my best to leave, so I'm leaning toward not confusing the user so:
a Question about this, is there any concerns about changing the signature of the
Leave
func?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable at least for the timeout cases if they are distinguishable from other possible leave errors.
We could certainly make it clearer. In practice everything in Serf is "best effort" even if you don't get any timeouts at all there is still no guarantee that every node in the cluster has acknowledged the leave before this method returns because eventual consistency! I think it would be useful to call that out in leave especially though as its often confusing for folks when leaves don't happen cleanly.
👍
My feeling is we shouldn't change the func signature. There are other cases where an error is reasonable - for example broadcast could theoretically fail because of some other reason like the node is already in a left state or the network is misconfigured and errors as soon as you try to send a packet on the transport (remember transport is pluggable not always UDP necessarily). Ideally we'd still return those. And even if every possibly code path right now in practice returns only errors we are choosing to swallow, I personally don't think it makes sense to break API compatibility just to "clean up" the error return.