-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable gRPC keepalive dial option on default #238
Comments
/assign @pohly |
Thanks for the detailed analysis, that helped a lot. The keepalive feature was enabled for testing over TCP because we have some tests in a custom testsuite in the PMEM-CSI driver where we intentionally reboot the target system and gRPC did not detect that without keepalive. As the comment says ( Lines 44 to 46 in d305288
|
/assign |
Hmm, but then it still fails for TCP when the driver is slow. |
I did not test it for TCP. Do you reproduce it for TCP? |
I wrote a small program to test this issue and reproduce it for TCP just now.
|
Yes. I had to enhance the mock driver first such that it also allows TCP, but then your reproducer fails also in that mode. Re-reading https://godoc.org/google.golang.org/grpc/keepalive#ClientParameters I found a comment (again) about matching it with the server options. I think when I read that the first time, I interpreted it as "timeout values must match" and assumed that the defaults should be fine on the client, as we don't change those on the server. My second reading made me wonder whether it means that keepalive must be enabled explicitly on the server. But I just tried that and it doesn't fix the issue. At this point I am tempted to just turn the canonical client/server example from gRPC into a reproducer and report this upstream.... done: grpc/grpc-go#3171 As a hot-fix we could make keepalive optional, but I'd prefer to fully understand the problem first. |
Thanks for reporting this upstream. I agree with you that fully understand the problem. |
I can do that after #233 got merged. Otherwise we'll just have merge conflicts. |
The gRPC defaults for a client do not work for a gRPC server that also uses the defaults (grpc/grpc-go#3171) which caused connection drops when a CSI driver is slow to respond and the client starts sending keepalive pings too quickly (kubernetes-csi#238). The sanity package no longer uses gRPC keepalive. Instead, custom test suites can enable that via some new configuration options if needed and when it is certain that the CSI driver under test can handle it.
The gRPC defaults for a client do not work for a gRPC server that also uses the defaults (grpc/grpc-go#3171) which caused connection drops when a CSI driver is slow to respond and the client starts sending keepalive pings too quickly (kubernetes-csi#238). The sanity package no longer uses gRPC keepalive. Instead, custom test suites can enable that via some new configuration options if needed and when it is certain that the CSI driver under test can handle it.
The gRPC defaults for a client do not work for a gRPC server that also uses the defaults (grpc/grpc-go#3171) which caused connection drops when a CSI driver is slow to respond and the client starts sending keepalive pings too quickly (kubernetes-csi#238). The sanity package no longer uses gRPC keepalive. Instead, custom test suites can enable that via some new configuration options if needed and when it is certain that the CSI driver under test can handle it.
What happened:
What you expected to happen:
For the first problem, it is should not report transport is closing.
For the second problem, the transport closing problem should not affect subsequent test cases.
How to reproduce it (as minimally and precisely as possible):
Add
time.Sleep(100*time.Second)
in ControllerDeleteSnapshotcsi-test/mock/service/controller.go
Line 504 in d305288
For save time, we could use
FIt
to run focused testcase.csi-test/pkg/sanity/controller.go
Line 1909 in d305288
Run
./hack/e2e.sh
Output:
Anything else we need to know?:
After my experiment, I believe it caused by two reasons.
My CSI driver DeleteSnapshot RPC call taking too much time about 100 seconds in some particular circumstance. But, in my opinion, it is reasonable in the automatic test. For example, storage system must ensure leasing info ready before deleting snapshot.
gRPC keepalive mechanism.
In my opinion,
For the first problem, I tried to remove keepalive dial option and the transport is not closing anymore.
csi-test/utils/grpcutil.go
Lines 47 to 48 in d305288
So, we could add a flag to add keepalive to dial options and disable it on default.
For the second problem, we can update gRPC dependency to the latest released version.
grpc/grpc-go#2669
Environment:
cat /etc/os-release
): Ubuntu 16.04The text was updated successfully, but these errors were encountered: