finish the leave process if broadcasting leave timeout #640

dhiaayachi · 2021-11-24T20:47:13Z

This fix #638

banks · 2022-01-05T15:10:22Z

serf/serf.go

@@ -727,7 +726,7 @@ func (s *Serf) Leave() error {
 		select {
 		case <-notifyCh:
 		case <-time.After(s.config.BroadcastTimeout):
-			return errors.New("timeout while waiting for graceful leave")
+			s.logger.Printf("[WARN] serf: timeout while waiting for graceful leave")


Is it worth adding a comment to point out why we don't return on this "error" but continue to attempt to leave?

Also do we need to do the same for the Leave call below since that will also error and return after the same timeout (i.e. with high probability in a large cluster).

It might not be depending on what the calling code does. The benefit would be that we could be sure that we remain online for LeavePropagationDelay to increase chances of a clean leave. If the calling application just logs the error and exits anyway for example then it would be better to not return an error in the timeout case there too to make a best effort of allowing propagation despite not making it through the broadcast queues fully in the time allotted?

Thinking more about this I think:

it's important to finish the leave sequence independently from any error that could happen, so the Leave call should make a best effort to execute each of the leave steps within the given timeout

notifying the calling code about any error that happened during the leave is useful I think. The caller could have a mean to do further steps in case of error (or avoid some further steps). Also the caller could implement a retry mechanism.

Based on this, I think the best would be to wrap all the timeout errors that we could encounter during the sequence (both for Leave broadcast and the memberList leave) and if any return it at the end of the sequence.

The caller could have a mean to do further steps in case of error

🤔 If the error is a broadcast timeout though and we still went through all the steps, is there anything meaningful the caller can do other than log it? It doesn't make sense to call Leave again since it's already left in its own serf state and is just as likely to timeout on the broadcasts a second time unless it waits for a really long time for its broadcast queues to empty which is likely not a reasonable option when calling Leave as 99% of the time this is during a process shutdown.

Overall I think you're right, I just wonder what the value of reporting an error the user they can't do anything about is? At very least we should probably update the doc comment to point out that even if a timeout error is returned, the node will still have "left" the cluster, it'd just not possible to determine in the bounded time whether all other peers will have seen that. Technically it's not possible to know that for sure even if it doesn't time out actually, but it could be even more confusing to get a timeout error for an operation that is likely to have been successful 🤷 .

I was thinking more about if the serf leave fail the calling app have another mean to reach other servers and notify them about it leaving, but that's a far fetch scenario I admit.
Yes it's a bit tricky because the timeout error in this case is more like, I can't confirm that I was able to notify the other nodes about leaving but I did my best to leave, so I'm leaning toward not confusing the user so:

not return an error

document the Leave as a best effort

always finish the sequence
a Question about this, is there any concerns about changing the signature of the Leave func?

not return an error

Seems reasonable at least for the timeout cases if they are distinguishable from other possible leave errors.

document the Leave as a best effort

We could certainly make it clearer. In practice everything in Serf is "best effort" even if you don't get any timeouts at all there is still no guarantee that every node in the cluster has acknowledged the leave before this method returns because eventual consistency! I think it would be useful to call that out in leave especially though as its often confusing for folks when leaves don't happen cleanly.

always finish the sequence

👍

a Question about this, is there any concerns about changing the signature of the Leave func?

My feeling is we shouldn't change the func signature. There are other cases where an error is reasonable - for example broadcast could theoretically fail because of some other reason like the node is already in a left state or the network is misconfigured and errors as soon as you try to send a packet on the transport (remember transport is pluggable not always UDP necessarily). Ideally we'd still return those. And even if every possibly code path right now in practice returns only errors we are choosing to swallow, I personally don't think it makes sense to break API compatibility just to "clean up" the error return.

banks

I think this behaviour makes more sense as discussed here. I think a comment or two about why we don't return the error may be useful as a pointer, can just link to this PR if that's easiest, but up to you - that doesn't need to block this PR.

There is a CI failure right now but it seems to be unrelated - might be worth double checking though as it's possible that we called Leave() in some tear down methods and the timing assumptions have now changed.

finish the leave process if broadcasting leave timeout

616bcc4

banks reviewed Jan 5, 2022

View reviewed changes

Log broadcast timeout as WARN and finish the Leave process.

d59a77e

banks approved these changes Jan 11, 2022

View reviewed changes

increase timeout

3f94039

dhiaayachi merged commit daf7d4f into master Jan 12, 2022

dhiaayachi mentioned this pull request Jan 13, 2022

update serf to v0.9.7 hashicorp/consul#12057

Merged

banks deleted the leave-broadcast-fix branch July 14, 2022 14:50

sayap mentioned this pull request Apr 1, 2024

Graceful leave will often timeout on a large cluster though nothing is wrong hashicorp/consul#8435

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

finish the leave process if broadcasting leave timeout #640

finish the leave process if broadcasting leave timeout #640

dhiaayachi commented Nov 24, 2021

banks Jan 5, 2022

banks Jan 5, 2022

dhiaayachi Jan 5, 2022

banks Jan 6, 2022 •

edited

Loading

dhiaayachi Jan 6, 2022

banks Jan 7, 2022

banks left a comment

finish the leave process if broadcasting leave timeout #640

finish the leave process if broadcasting leave timeout #640

Conversation

dhiaayachi commented Nov 24, 2021

banks Jan 5, 2022

Choose a reason for hiding this comment

banks Jan 5, 2022

Choose a reason for hiding this comment

dhiaayachi Jan 5, 2022

Choose a reason for hiding this comment

banks Jan 6, 2022 • edited Loading

Choose a reason for hiding this comment

dhiaayachi Jan 6, 2022

Choose a reason for hiding this comment

banks Jan 7, 2022

Choose a reason for hiding this comment

banks left a comment

Choose a reason for hiding this comment

banks Jan 6, 2022 •

edited

Loading