-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid returning early on agent join failures #1473
Conversation
@@ -192,7 +192,6 @@ func (c *controller) agentSetup() error { | |||
if remoteAddr != "" { | |||
if err := c.agentJoin(remoteAddr); err != nil { | |||
logrus.Errorf("Error in agentJoin : %v", err) | |||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not returning the error on the first place (returning nil). Am not sure how this will fix the issue mentioned in the description. The only possible effect this has is the fact that agentInitDone
is not reinitialized. Is that the issue that you are intending to fix ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we return early agentInitDone
is not closed. If it is not closed the go routing waiting on it to proceed to plumbing the ingress sandbox will not happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Just noticed that even though the networkDB retry logic takes effect, call to nDB.sendNodeEvent
is not involved in that retry. am not sure if that is required. Just giving my observation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not strictly needed as the other node will be able to work with just getting a join notification from memberlist. But I think it is still correct to make that fix. Will update the PR.
In what cases do we see gossip join failure ? Since we already got the keys the gRPC session should be working. |
@sanimej When we are restarting all the nodes in the cluster at the same time the daemon which joined the cluster using |
When a gossip join failure happens do not return early in the call chain because a join failure is most likely transient and the retry logic built in the networkdb is going to retry and succeed. Returning early makes the initialization of ingress network/sandbox to not happen which causes a problem even after the gossip join on retry is successful. Signed-off-by: Jana Radhakrishnan <[email protected]>
LGTM. As a side note we don't need NodeEventTypeJoin since the memberlist join is sufficient. |
LGTM. |
When a gossip join failure happens do not return early in the call chain because a join failure is most likely transient and the retry logic built in the networkdb is going to retry and succeed. Returning early makes the initialization of ingress network/sandbox to not happen which causes a problem even after the gossip join on retry is successful.
Signed-off-by: Jana Radhakrishnan [email protected]