gRPC streaming keepAlive ping never fails when proxied through Envoy #2086

cdelguercio · 2017-11-20T22:38:37Z

Title: gRPC streaming keepAlive ping never fails when proxied through Envoy

Envoy: envoyproxy/envoy-alpine:cd514cc3f1ad82bfd57b6b832b379eb9a2888891
gRPC: grpc-go 1.7.2

Description:
I have a Docker setup where I am running Envoy and a gRPC service running in a single container. Envoy is proxying port 80 to port 8000 where the service is listening. The gRPC has a server->client unidirectional streaming endpoint that has keepAlive enabled so that if a client ever disconnects ungracefully, they won't leave a hanging connection. When I connect to my service directly and Ctrl-Z my test client, in ~30 seconds the server notices that a keepAlive HTTP/2 PING has failed, so it closes the connection. When I connect to my service through Envoy and Ctrl-Z my test client, the connection hangs forever.

I test this locally by running my docker container, and then from my local machine I first point my gRPC test client to port 8000 to bypass Envoy. I get the following results on Wireshark on the docker0 interface:

At the end, there are 3 groups of 3 TCP frames at 55, 85, and 115 seconds on port 8000. These are obviously the keepAlive HTTP/2 PINGs.

Here is what happens when I go through Envoy on port 80:

Here I see the actual HTTP/2, but it's only on the initial connection. No matter how long I listen, I never see any keepAlive frames. I assume my service is still sending the keepAlive PINGs to Envoy on the docker container's loopback interface, but I don't know an easy way to capture that.

gRPC KeepAlive Go config:

keepAliveOpt := grpc.KeepaliveParams(keepalive.ServerParameters{
	MaxConnectionIdle:     infinity,
	MaxConnectionAge:      infinity,
	MaxConnectionAgeGrace: infinity,
	Time:    25 * time.Second,
	Timeout: 5 * time.Second,
})

keepAliveEnforcementPolicyOpt := grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{
	MinTime:             5 * time.Minute,
	PermitWithoutStream: false,
})

Envoy config:

Notice that I have a separate route for my streaming endpoint, because I needed to make the timeout_ms: 0

{
  "listeners": [
    {
      "address": "tcp://0.0.0.0:80",
      "filters": [
        {
          "type": "read",
          "name": "http_connection_manager",
          "config": {
            "codec_type": "auto",
            "stat_prefix": "ingress_http",
            "route_config": {
              "virtual_hosts": [
                {
                  "name": "local_service",
                  "domains": ["*"],
                  "routes": [
                    {
                      "timeout_ms": 0,
                      "prefix": "/gprc.prefix.to.my.streaming/Endpoint",
                      "headers": [
                        {"name": "content-type", "value": "application/grpc"}
                      ],
                      "cluster": "local_service_grpc",
                      "retry_policy": {
                        "retry_on": "5xx",
                        "num_retries": 3
                      }
                    },
                    {
                      "timeout_ms": 10000,
                      "prefix": "/",
                      "headers": [
                        {"name": "content-type", "value": "application/grpc"}
                      ],
                      "cluster": "local_service_grpc",
                      "retry_policy": {
                        "retry_on": "5xx",
                        "num_retries": 3
                      }
                    },
                    {
                      "timeout_ms": 10000,
                      "prefix": "/",
                      "cluster": "local_service_http"
                    }
                  ]
                }
              ]
            },
            "filters": [
              {
                "type": "decoder",
                "name": "router",
                "config": {}
              },
              {
                "type": "both",
                "name": "health_check",
                "config": {
                  "pass_through_mode": true,
                  "endpoint": "/healthcheck"
                }
              }
            ]
          }
        }
      ]
    }
  ],
  "admin": {
    "access_log_path": "/dev/null",
    "address": "tcp://0.0.0.0:8001"
  },
  "cluster_manager": {
     "clusters": [
      {
        "name": "local_service_grpc",
        "connect_timeout_ms": 10000,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "features": "http2",
        "hosts": [
          {
            "url": "tcp://127.0.0.1:8000"
          }
        ]
      },
      {
        "name": "local_service_http",
        "connect_timeout_ms": 10000,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "hosts": [
          {
            "url": "tcp://127.0.0.1:8000"
          }
        ]
      }
    ],
  }
}

The text was updated successfully, but these errors were encountered:

mattklein123 · 2017-11-21T01:14:24Z

I'm not exactly sure what "keep alive" means in your setup, but if it means proxying PING frames, Envoy does not do that currently.

mattklein123 · 2017-11-21T01:15:59Z

FYI we are going to add streaming timeouts (basically timeout if no data frames are received in X seconds in either upstream/downstream direction). This would be a potential fix to ^.

htuch · 2017-11-21T01:19:39Z

PING frames are only hop-by-hop in HTTP/2 and per-connection, and I don't know if there are any sensible proxying semantics if you are splitting a single client stream across multiple upstream hosts for example. I think Envoy should respond to PING, but not forward.

mattklein123 · 2017-11-21T01:20:39Z

PING frames are only hop-by-hop in HTTP/2 and per-connection, and I don't know if there are any sensible proxying semantics if you are splitting a single client stream across multiple upstream hosts for example. I think Envoy should respond to PING, but not forward.

I agree. If asked to actually proxy PING, I was going to say no. :)

cdelguercio · 2017-11-21T02:11:42Z

Sorry, I assumed you would know more about gRPC keep alive since Envoy normally has such great support for gRPC. In this case it is an option for gRPC streams where you set a time interval (called Time in the gRPC-Go library's KeepAliveParams config struct) and then gRPC will consistently ping the client to make sure that it hasn't disconnected ungracefully. A client that gracefully disconnects will send some sort of DC message to the server. A client that ungracefully disconnects sends nothing, so if this keep alive ping isn't turned on then all ungracefully disconnected clients will have a corresponding hanging connection on the server.

It makes sense that Envoy responds to PING frames. Because HTTP/2 PINGs seem a lot different from ICMP pings I was hoping that the normal pings rules didn't apply. It's unfortunate that I can't use the keep alive feature of gRPC. I guess I will have to build my own application level heartbeat.

cdelguercio · 2017-11-21T02:18:13Z

Though the problem here is that my service thinks that the request is coming from Envoy, so the PING goes to Envoy, but it is pretty clear that it is intended for the client. Am I getting that wrong?

mattklein123 · 2017-11-21T02:27:24Z

@cdelguercio the issue is that the keep-alive mode expects an absence of ping response to mean the other side is gone, and Envoy will always directly respond to ping. In general Envoy would reset the stream, but if there is no FIN and it never tries to write, that won't happen either.

I'm pretty sure that #1778 will fix your issue (which I now see you also originally opened!). If that was in place, you could set data layer timeouts, at which point Envoy would shutdown the stream/connection.

cdelguercio · 2017-11-21T02:48:27Z

Right, and #1778 would fix it as long as the solution includes a heartbeat of some sort, since in my case the connection can (correctly) be open for an indefinite period of time without sending data and still be valid.

cdelguercio · 2018-02-06T18:58:07Z

I saw the http: adding 100-Continue support to Envoy (#2497) PR that got merged. Would it make sense to allow Envoy to be configured to proxy PING frames as a non default option?

mattklein123 · 2018-02-07T00:10:05Z

@cdelguercio per the previous discussion, I'm not really sure what it means to proxy ping. Ping is per connection, not per stream. What would the semantics be?

mpuncel · 2018-06-06T23:44:35Z

would it be reasonable for Envoy to reset all of the "upstream streams" corresponding to a dead downstream TCP connection and vice versa?

mattklein123 · 2018-06-07T04:06:00Z

@mpuncel Envoy already does this.

jrajahalme · 2018-11-27T00:31:17Z

@mattklein123 Using the old closed issue for context... is there a way to use Envoy stream idle timeouts on a route that also handles gRPC streaming connections that may be idle for "a long time" (longer than the stream idle timeout). What I'm looking at (maybe) is to not have a configured stream idle timeout break gRPC streaming connections before grpc-timeout in effect fires?

mattklein123 · 2018-11-27T13:53:15Z

@jrajahalme I don't think so currently. Please open a fresh issue for discussion.

jrajahalme · 2018-11-28T01:39:57Z

Opened #5142

Merge release-1.1 master

mattklein123 added the question Questions that are neither investigations, bugs, nor enhancements label Nov 21, 2017

cdelguercio closed this as completed Nov 21, 2017

cdelguercio mentioned this issue Nov 28, 2017

Keep Alive for streaming doesn't work behind Envoy proxy grpc/grpc-go#1695

Closed

ryanrhee mentioned this issue Apr 2, 2019

Proposal: allow h2 ping optionally to reset idle timeouts #6464

Open

rajatsharma94 mentioned this issue Jun 3, 2019

grpc keepalive/h2 ping support for upstream #7136

Open

anjmao mentioned this issue Aug 7, 2019

Support grpc keep alive server parameters kubernetes/ingress-nginx#4402

Closed

xstevens mentioned this issue Sep 19, 2019

Contour should configure a stream_idle_timeout projectcontour/contour#1547

Closed

Shikugawa pushed a commit to Shikugawa/envoy that referenced this issue Mar 28, 2020

Merge pull request envoyproxy#2086 from hklai/1.1-master

05a1534

Merge release-1.1 master

ahmelsayed mentioned this issue Mar 4, 2022

Unexpected HTTP2 END_STREAM and keep alive behavior microsoft/azure-container-apps#113

Closed

swallez mentioned this issue Sep 9, 2024

support HTTP/2.0 elastic/elasticsearch#10981

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gRPC streaming keepAlive ping never fails when proxied through Envoy #2086

gRPC streaming keepAlive ping never fails when proxied through Envoy #2086

cdelguercio commented Nov 20, 2017 •

edited

Loading

mattklein123 commented Nov 21, 2017

mattklein123 commented Nov 21, 2017

htuch commented Nov 21, 2017

mattklein123 commented Nov 21, 2017

cdelguercio commented Nov 21, 2017

cdelguercio commented Nov 21, 2017

mattklein123 commented Nov 21, 2017

cdelguercio commented Nov 21, 2017

cdelguercio commented Feb 6, 2018

mattklein123 commented Feb 7, 2018

mpuncel commented Jun 6, 2018 •

edited

Loading

mattklein123 commented Jun 7, 2018

jrajahalme commented Nov 27, 2018

mattklein123 commented Nov 27, 2018

jrajahalme commented Nov 28, 2018

gRPC streaming keepAlive ping never fails when proxied through Envoy #2086

gRPC streaming keepAlive ping never fails when proxied through Envoy #2086

Comments

cdelguercio commented Nov 20, 2017 • edited Loading

mattklein123 commented Nov 21, 2017

mattklein123 commented Nov 21, 2017

htuch commented Nov 21, 2017

mattklein123 commented Nov 21, 2017

cdelguercio commented Nov 21, 2017

cdelguercio commented Nov 21, 2017

mattklein123 commented Nov 21, 2017

cdelguercio commented Nov 21, 2017

cdelguercio commented Feb 6, 2018

mattklein123 commented Feb 7, 2018

mpuncel commented Jun 6, 2018 • edited Loading

mattklein123 commented Jun 7, 2018

jrajahalme commented Nov 27, 2018

mattklein123 commented Nov 27, 2018

jrajahalme commented Nov 28, 2018

cdelguercio commented Nov 20, 2017 •

edited

Loading

mpuncel commented Jun 6, 2018 •

edited

Loading