Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC streaming keepAlive ping never fails when proxied through Envoy #2086

Closed
cdelguercio opened this issue Nov 20, 2017 · 15 comments
Closed
Labels
question Questions that are neither investigations, bugs, nor enhancements

Comments

@cdelguercio
Copy link

cdelguercio commented Nov 20, 2017

Title: gRPC streaming keepAlive ping never fails when proxied through Envoy

Envoy: envoyproxy/envoy-alpine:cd514cc3f1ad82bfd57b6b832b379eb9a2888891
gRPC: grpc-go 1.7.2

Description:
I have a Docker setup where I am running Envoy and a gRPC service running in a single container. Envoy is proxying port 80 to port 8000 where the service is listening. The gRPC has a server->client unidirectional streaming endpoint that has keepAlive enabled so that if a client ever disconnects ungracefully, they won't leave a hanging connection. When I connect to my service directly and Ctrl-Z my test client, in ~30 seconds the server notices that a keepAlive HTTP/2 PING has failed, so it closes the connection. When I connect to my service through Envoy and Ctrl-Z my test client, the connection hangs forever.

I test this locally by running my docker container, and then from my local machine I first point my gRPC test client to port 8000 to bypass Envoy. I get the following results on Wireshark on the docker0 interface:

port8000_cropped

At the end, there are 3 groups of 3 TCP frames at 55, 85, and 115 seconds on port 8000. These are obviously the keepAlive HTTP/2 PINGs.

Here is what happens when I go through Envoy on port 80:

port80_cropped

Here I see the actual HTTP/2, but it's only on the initial connection. No matter how long I listen, I never see any keepAlive frames. I assume my service is still sending the keepAlive PINGs to Envoy on the docker container's loopback interface, but I don't know an easy way to capture that.

gRPC KeepAlive Go config:

keepAliveOpt := grpc.KeepaliveParams(keepalive.ServerParameters{
	MaxConnectionIdle:     infinity,
	MaxConnectionAge:      infinity,
	MaxConnectionAgeGrace: infinity,
	Time:    25 * time.Second,
	Timeout: 5 * time.Second,
})

keepAliveEnforcementPolicyOpt := grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{
	MinTime:             5 * time.Minute,
	PermitWithoutStream: false,
})

Envoy config:

Notice that I have a separate route for my streaming endpoint, because I needed to make the timeout_ms: 0

{
  "listeners": [
    {
      "address": "tcp://0.0.0.0:80",
      "filters": [
        {
          "type": "read",
          "name": "http_connection_manager",
          "config": {
            "codec_type": "auto",
            "stat_prefix": "ingress_http",
            "route_config": {
              "virtual_hosts": [
                {
                  "name": "local_service",
                  "domains": ["*"],
                  "routes": [
                    {
                      "timeout_ms": 0,
                      "prefix": "/gprc.prefix.to.my.streaming/Endpoint",
                      "headers": [
                        {"name": "content-type", "value": "application/grpc"}
                      ],
                      "cluster": "local_service_grpc",
                      "retry_policy": {
                        "retry_on": "5xx",
                        "num_retries": 3
                      }
                    },
                    {
                      "timeout_ms": 10000,
                      "prefix": "/",
                      "headers": [
                        {"name": "content-type", "value": "application/grpc"}
                      ],
                      "cluster": "local_service_grpc",
                      "retry_policy": {
                        "retry_on": "5xx",
                        "num_retries": 3
                      }
                    },
                    {
                      "timeout_ms": 10000,
                      "prefix": "/",
                      "cluster": "local_service_http"
                    }
                  ]
                }
              ]
            },
            "filters": [
              {
                "type": "decoder",
                "name": "router",
                "config": {}
              },
              {
                "type": "both",
                "name": "health_check",
                "config": {
                  "pass_through_mode": true,
                  "endpoint": "/healthcheck"
                }
              }
            ]
          }
        }
      ]
    }
  ],
  "admin": {
    "access_log_path": "/dev/null",
    "address": "tcp://0.0.0.0:8001"
  },
  "cluster_manager": {
     "clusters": [
      {
        "name": "local_service_grpc",
        "connect_timeout_ms": 10000,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "features": "http2",
        "hosts": [
          {
            "url": "tcp://127.0.0.1:8000"
          }
        ]
      },
      {
        "name": "local_service_http",
        "connect_timeout_ms": 10000,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "hosts": [
          {
            "url": "tcp://127.0.0.1:8000"
          }
        ]
      }
    ],
  }
}
@mattklein123 mattklein123 added the question Questions that are neither investigations, bugs, nor enhancements label Nov 21, 2017
@mattklein123
Copy link
Member

I'm not exactly sure what "keep alive" means in your setup, but if it means proxying PING frames, Envoy does not do that currently.

@mattklein123
Copy link
Member

FYI we are going to add streaming timeouts (basically timeout if no data frames are received in X seconds in either upstream/downstream direction). This would be a potential fix to ^.

@htuch
Copy link
Member

htuch commented Nov 21, 2017

PING frames are only hop-by-hop in HTTP/2 and per-connection, and I don't know if there are any sensible proxying semantics if you are splitting a single client stream across multiple upstream hosts for example. I think Envoy should respond to PING, but not forward.

@mattklein123
Copy link
Member

PING frames are only hop-by-hop in HTTP/2 and per-connection, and I don't know if there are any sensible proxying semantics if you are splitting a single client stream across multiple upstream hosts for example. I think Envoy should respond to PING, but not forward.

I agree. If asked to actually proxy PING, I was going to say no. :)

@cdelguercio
Copy link
Author

Sorry, I assumed you would know more about gRPC keep alive since Envoy normally has such great support for gRPC. In this case it is an option for gRPC streams where you set a time interval (called Time in the gRPC-Go library's KeepAliveParams config struct) and then gRPC will consistently ping the client to make sure that it hasn't disconnected ungracefully. A client that gracefully disconnects will send some sort of DC message to the server. A client that ungracefully disconnects sends nothing, so if this keep alive ping isn't turned on then all ungracefully disconnected clients will have a corresponding hanging connection on the server.

It makes sense that Envoy responds to PING frames. Because HTTP/2 PINGs seem a lot different from ICMP pings I was hoping that the normal pings rules didn't apply. It's unfortunate that I can't use the keep alive feature of gRPC. I guess I will have to build my own application level heartbeat.

@cdelguercio
Copy link
Author

Though the problem here is that my service thinks that the request is coming from Envoy, so the PING goes to Envoy, but it is pretty clear that it is intended for the client. Am I getting that wrong?

@mattklein123
Copy link
Member

@cdelguercio the issue is that the keep-alive mode expects an absence of ping response to mean the other side is gone, and Envoy will always directly respond to ping. In general Envoy would reset the stream, but if there is no FIN and it never tries to write, that won't happen either.

I'm pretty sure that #1778 will fix your issue (which I now see you also originally opened!). If that was in place, you could set data layer timeouts, at which point Envoy would shutdown the stream/connection.

@cdelguercio
Copy link
Author

Right, and #1778 would fix it as long as the solution includes a heartbeat of some sort, since in my case the connection can (correctly) be open for an indefinite period of time without sending data and still be valid.

@cdelguercio
Copy link
Author

I saw the http: adding 100-Continue support to Envoy (#2497) PR that got merged. Would it make sense to allow Envoy to be configured to proxy PING frames as a non default option?

@mattklein123
Copy link
Member

@cdelguercio per the previous discussion, I'm not really sure what it means to proxy ping. Ping is per connection, not per stream. What would the semantics be?

@mpuncel
Copy link
Contributor

mpuncel commented Jun 6, 2018

would it be reasonable for Envoy to reset all of the "upstream streams" corresponding to a dead downstream TCP connection and vice versa?

@mattklein123
Copy link
Member

@mpuncel Envoy already does this.

@jrajahalme
Copy link
Contributor

@mattklein123 Using the old closed issue for context... is there a way to use Envoy stream idle timeouts on a route that also handles gRPC streaming connections that may be idle for "a long time" (longer than the stream idle timeout). What I'm looking at (maybe) is to not have a configured stream idle timeout break gRPC streaming connections before grpc-timeout in effect fires?

@mattklein123
Copy link
Member

@jrajahalme I don't think so currently. Please open a fresh issue for discussion.

@jrajahalme
Copy link
Contributor

Opened #5142

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Questions that are neither investigations, bugs, nor enhancements
Projects
None yet
Development

No branches or pull requests

5 participants