-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clientv3 doesn't close the existing TCP session when a network failure happens with streams #9212
Comments
Right now, we let gRPC handle load balancing between endpoints (passing v1 |
Also asked gRPC team whether we need manual handling on the |
@gyuho Thank you for the info and working on the issue. |
This issue happens with etcd versions 3.2.10. Till 3.2.9, clients keeps a single TCP session at the same situation. |
@yudai Thanks for reproduce. I was able to see the same issue. Was seeing lots of Will debug further. We may have the same issue in master, since underlying connection interface is the same. |
I can confirm that enabling client-side keep-alive in v3.2 and v3.3 resolves the issue. Same code as @yudai's but with keepalives: cli, err := clientv3.New(clientv3.Config{
Endpoints: []string{"localhost:2379"},
DialTimeout: 5 * time.Second,
DialKeepAliveTime: 2 * time.Second,
DialKeepAliveTimeout: 2 * time.Second,
}) Once keepalive times out, http2 client gets closed by gRPC, thus no lingering TCP connection. If client-side keep alive is not enabled, gRPC keeps previous connections. This only happens when etcd clientv3 uses old gRPC balancer wrapper, which in this case, asynchronously resets transport on connection failures ( In master branch, we create our sub-connections and do not have this issue even without keep-alive. |
@yudai Can you try with etcd client keep-alive parameters? |
@gyuho Confirmed that the stale session is closed when I set Closing as resolved! |
@yudai Thanks for report! |
What I observed
I found that some of my client processes had two TCP sessions to their etcd endpoints (both of them were in ESTABLISHED state).
I think it's a sort of unexpected state since each instance of clientv3's
Client
is, probably, supposed to keep only one TCP connection at a time. If we allow clients to keep two TCP sessions each, the cluster can consume resources more than expected.Reproduction
(with the master branch at
64713e5927
)With some debugging, I discovered that a client starts holding two TCP sessions when;
Watch()
)Put()
) with some context timeout in another goroutineIt seems the balancer starts to establish a new TCP session (and fails) when a one-shot request fails due to a connectivity issue (you get
context deadline exceeded
). After the connectivity has recovered, the balancer succeeds to establish the new TCP session it wanted. However, it doesn't close the previous TCP session at this timing when there is a stream request ongoing and as long as the stream request is alive (usually, it's alive till the process itself dies). Eventually both of the new and previous(existing) TCP sessions persists until the process dies.I created a script to reproduce the issue by emulating a network trouble with iptables: https://gist.github.com/yudai/a5e1269ef3d484217b10415af6dd08c4
You would see the output like below, with two ESTABLISHED TCP sessions at last:
https://gist.github.com/yudai/a5e1269ef3d484217b10415af6dd08c4#gistcomment-2330187
Is this a bug or expected behavior?
If each client keeps two TCP sessions, endpoint processes need to maintain twice times of some resources, such as file descriptors and memory. I therefore think it's not a good idea to let clients to have two TCP session in any case, even after network troubles.
Given that the stream APIs have its internal retry mechanism and users are not responsible for taking care of endpoint failures, I think that the balancer should do something to close the previous TCP session even when there is ongoing stream requests.
The text was updated successfully, but these errors were encountered: