-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tunnel auth clients appear to become stuck in bad state on restart #9655
Comments
This appears to be caused by a bug in gRPC. We are currently using a version of Line 99 in 622e0aa
https://github.com/grpc/grpc-go/releases/tag/v1.29.1 Running the test with the gRPC logs on, it seems that when the clusters are restarted the channel connectivity state immediately transitions into
I created a branch that updates the Additionally the gRPC logs no longer show a transition into the |
After testing |
Update grpc dependency to the latest version. Needed to fix the client side hang that prevents TwoClustersTunnel from running succesfully, see #9655.
Update grpc dependency to the latest version. Needed to fix the client side hang that prevents TwoClustersTunnel from running succesfully, see #9655.
I just encountered an error related to this:
|
When investigating the high failure rate of the
TwoClustersTunnel
issue, rj discovered that this call toGetNodes
appears to block nearly indefinitely. Upon further investigation, we found that this occurs when the cache is unhealthy and the call toGetNodes
is forwarded to the leaf cluster's auth server. The test could be "fixed" by applying a very short (<=5s) timeout here. This solution can't work in production, since realGetNodes
calls can take quite a while in very large clusters.Our working theory is that the GRPC client is blocking on the old unhealthy tunnel connection instead of erring out and eventually receiving a new healthy tunnel connection. The dialer used by the GRPC client is here, which is probably where an investigation aught to begin. Ideally, we want the GRPC client to err out and re-dial as soon as possible after the leaf cluster is restarted.
The text was updated successfully, but these errors were encountered: