Intermittent 503s sending requests to gRPC service. #504

thedevelopnik · 2018-06-08T15:04:19Z

Describe the bug
We get intermittent 503s when sending requests through Ambassador to a grpc service. Often they will come as the first few requests after the service receives no requests for awhile, but they will still come intermittently.

To Reproduce
Steps to reproduce the behavior:

Make a grpc server and put it behind ambassador.
Send multiple requests in a row to the service from a client, does not have to be concurrent or high volume.
See 503.

Expected behavior
Ambassador responds as available for all requests to an established service.

Versions (please complete the following information):

Ambassador: [0.34.0, 0.34.1]
Kubernetes environment [e.g. Kops managed cluster running in AWS, master is t2.medium, nodes are t2.large/t2.xlarge]
Version [e.g. 1.9.8]

Additional context
There is an AWS NLB in front of Ambassador, and we have TLS configured. TLS is terminated both in Ambassador and in the application, using the same certificate. Hitting the service through an AWS NLB without Ambassador does not have this problem. I may be able to provide the test client code we are using so you can hit our service, but since this is internal proprietary code it's not something I can just point to here. I can't guarantee that the reproduction steps above will reproduce.

Looking through the ambassador code it looks like the only place it's set to send 503s is on its health check or using the auth service, and we're not using the auth service, so this might be an underlying Envoy issue?

thedevelopnik · 2018-06-08T16:01:56Z

After more investigation, I would re-state the issue as the first call on a channel/connection intermittently fails. The way the node client works it opens up a channel for each service. It looks like its only ever the first call on these channels that fail (though even the first call sometimes succeeds). When the first call fails subsequent calls on that channel all seem to succeed.

thedevelopnik · 2018-06-08T16:17:37Z

example logs from a failed call from our go client:

2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: wrote SETTINGS len=0
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: wrote HEADERS flags=END_HEADERS stream=1 len=211
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: wrote DATA flags=END_STREAM stream=1 len=18 data="\x00\x00\x00\x00\r\n\v\x12\tsomething"
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: read SETTINGS len=6, settings: INITIAL_WINDOW_SIZE=268435456
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: read WINDOW_UPDATE len=4 (conn) incr=268369921
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: wrote SETTINGS flags=ACK len=0
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: read SETTINGS flags=ACK len=0
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: read HEADERS flags=END_HEADERS stream=1 len=49
2018/06/08 10:08:33 http2: decoded hpack field header field ":status" = "503"
2018/06/08 10:08:33 http2: decoded hpack field header field "content-length" = "57"
2018/06/08 10:08:33 http2: decoded hpack field header field "content-type" = "text/plain"
2018/06/08 10:08:33 http2: decoded hpack field header field "date" = "Fri, 08 Jun 2018 16:08:32 GMT"
2018/06/08 10:08:33 http2: decoded hpack field header field "server" = "envoy"
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: read DATA flags=END_STREAM stream=1 len=57 data="upstream connect error or disconnect/reset before headers"
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: wrote RST_STREAM stream=1 len=4 ErrCode=PROTOCOL_ERROR
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: wrote WINDOW_UPDATE len=4 (conn) incr=57
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: wrote PING len=8 ping="\x02\x04\x10\x10\t\x0e\a\a"
panic: rpc error: code = Internal desc = transport: received the unexpected content-type "text/plain"

thedevelopnik · 2018-06-12T16:00:08Z

We updated to Ambassador 0.34.2, using Envoy 1.6.0 and it looks like this has gone away. Need to do more extended testing to confirm.

thedevelopnik · 2018-06-14T14:26:53Z

This ended up being a conflict between Envoy and kube-router. Closing this ticket.

badoet · 2018-07-30T09:48:41Z

@thedevelopnik we encountering similar problem as well to our kube cluster
care to share and elaborate on " conflict between Envoy and kube-router" ?
thank you 🙇

AndrewFarley · 2018-10-03T21:00:22Z

+1 we unexpectedly had this problem recently also, with no logs or explanations. Restarting the Ambassador pods fixed our situation, but we're worried it will unexpectedly happen again. Logs were not useful or helpful, just Ambassador stopped routing and started 503-ing, although ours was not intermittent it was a drop-off and all requests 503'd.

mianelson · 2018-10-17T21:26:47Z

+1 we began having this problem very recently, without any clear indication of what's causing the failures. we have a fairly low request volume passing through ambassador during testing (<100req/m). redeploying the pods resolved the issue for us. same exact behavior as described by @AndrewFarley above. we're using ambassador v0.40 and kubernetes v1.10.7

Harishkrishna17 · 2018-12-07T11:46:23Z

+1, We are facing the same issue with version 0.40.2, I am able to reproduce this when there are no requests for the service for a couple of hours. In order to resolve this issue, we are planning to have a retry mechanism when there is a 503 error, But this doesn't seem to be the solution to this problem

Any help from the ambassador team will be much appreciated ,

richarddli · 2018-12-07T13:00:58Z

If anyone has a) a reproducer and/or b) can try testing and reproducing on 0.50-EA7, that would be really helpful! Thanks.

MathiasGr · 2018-12-07T14:32:41Z

Same here - could this be related to envoyproxy/envoy#3771 (comment) ?

AndrewFarley · 2018-12-07T14:41:34Z

^^ Unsure if that is the same issue, it could be.

When this last occurred for us once it was because we were using AuthService and one of the ambasador pods was failing auth service for no good reason but its live/readyness was still passing despite auth failing. We shelled into this pod and was able to reach the auth service just fine, it was just Ambassador being silly, but this is what probes are supposed to be for eh instead of just 503-ing everything.

The other time this occurred for us was when a bad annotation was pushed from our developers in our dev environment. Ambassador usually keeps working just fine for a few hours with a bad config in place, but we (had) no monitoring at the time to detect the bad annotation and eventually Ambassador just collapsed and started 503ing. This might be related to the envoy issue above. Restarting new pods didn't help in this case because of course then it said bad config and didn't know what the last "good" config was in a fresh pod. We've since added some monitoring to detect bad annotations.

abaymar · 2018-12-14T17:01:29Z

I'm having a problem with intermittent 503 ("upstream connect error or disconnect/reset before headers"), especially with a cold service start. Setting timeout_ms in my service route annotation doesn't help (e.g. I set it to 30000, but I get 503 within 3 seconds still). The service pod is on Windows nodes.

Another, less random, pattern happens when I deploy a Linux pod to kube-system namespace. 3 times out of 4 the pod would never see incoming requests (which result in 503), and if it fails the first time, it never works. Restarting both the service pod and Ambassador pod a few times eventually convinces the route to work - and then it works like a charm until I need to deploy a new container version, which leads back to 503.

Ambassador v0.40.2 + acs-engine + k8s v1.10.0 + Azure.

ryandawsonuk · 2019-04-04T17:35:12Z

Using ambassador 0.52.0 and seeing these intermittent errors especially on first startup. Not every time but often enough for it to be very noticeable in our suite of end to end tests. We've had to build retries into the tests that allow for an especially long delay in the retries just for grpc calls - SeldonIO/seldon-core#492 (comment). In that reference there's a link to a failure trace where it failed even with delay_since_first_attempt_ms = 44073. In those tests the grpc calls aren't even started until after REST calls have been made successfully. We never see this issue with REST.

We are having to downgrade back to what we were using (0.40.2) as we weren't seeing this issue with this kind of seriousness on that version.

ryandawsonuk · 2019-04-05T14:36:06Z

I've tried more versions and 0.50.0 seems fine whereas 0.51.0 has the grpc startup issue I just mentioned with the 503 happening repeatedly for the first minute or so. During that time the ambassador logs show lots of:

[2019-04-05 14:01:41.078][000083][warning][config] [bazel-out/k8-dbg/bin/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:70] gRPC config for type.googleapis.com/envoy.api.v2.Listener rejected: Error adding/updating listener ambassador-listener-80: cannot bind '0.0.0.0:80': Permission denied

I don't see that logging at all on 0.50.0. I am using the stable/ambassador helm chart with the default of container port as 8080 mapped to 80.

Given that error relates to port access, perhaps it will work in 0.60.0 when the default port changes 7ac8f4a#diff-4ac32a78649ca5bdd8e0ba38b7006a1eR5

richarddli · 2019-04-05T14:44:57Z

That is probably unrelated in the sense that you should just fix your permissions (or update Ambassador not to run on port 80).

ryandawsonuk · 2019-04-05T16:10:14Z

Yeah it's ok if I run as root. So we will do that for now rather than downgrade. But we're using the helm chart defaults (i.e. 8080) and not setting anything to use port 80 so we shoudn't really have to run as root. Our code and configuration is open source if you want to double-check.

iNoahNothing · 2019-04-08T14:07:10Z

@ryandawsonuk I have taken a look at your Ambassador values.yaml configuration and am having a hard time understanding how you are seeing the cannot bind '0.0.0.0:80' error. We've seen this happen since the helm chart is running as non-root by default so requires the service_port: 8080 annotation in the ambassador Module to configure Ambassador to listen on 8080 instead of 80.

Could you post a replicator so I can dig a little further into what is causing this issue?

ryandawsonuk · 2019-04-08T14:43:37Z

The best test I can offer right now is to clone https://github.com/SeldonIO/seldon-core and modify the helm chart to remove the root permission. Then go to testing/scripts and, with a kube context set as an empty minikube or cluster, run pytest -s -W ignore test_helm_charts_single_namespace.py

Given that the default port is changing in v0.60.0 I'll be happy to test on that version when it's available. Also happy to chat on slack

ryandawsonuk · 2019-04-10T10:26:27Z

After more testing it turns out the 503 issue is still there even as root. The warning we were seeing as non-root was not part of the cause (at least not directly). SeldonIO/seldon-core#473 (comment) explains. But it seems to be better in v0.50.0 so we're having to downgrade for now and will retest on v0.60.0 when that is available.

vaibhavrtk · 2019-04-30T06:28:42Z

Hi I am still seeing this issue.
Seems to me this issue is with the envoy. envoyproxy/envoy#6190.
They seem to have fixed it already. When will it be available in Ambassador?

vaibhavrtk · 2019-05-03T08:06:54Z

I found this issue to be related to IPVS. https://success.docker.com/article/ipvs-connection-timeout-issue

stale · 2019-07-02T08:22:23Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

axsaucedo · 2019-08-14T09:57:58Z

Reopening (as per #1587 & SeldonIO/seldon-core#473)

thedevelopnik closed this as completed Jun 14, 2018

richarddli added the to do label Aug 30, 2018

richarddli reopened this Aug 30, 2018

richarddli removed the to do label Aug 30, 2018

richarddli added the help wanted label Dec 7, 2018

ryandawsonuk mentioned this issue Mar 28, 2019

intermittent UNIMPLEMENTED error from ambassador SeldonIO/seldon-core#473

Closed

ukclivecox mentioned this issue May 15, 2019

Integrate use of Go Seldon Controller SeldonIO/seldon-core#559

Merged

ryandawsonuk mentioned this issue May 31, 2019

gRPC requests fail for intermittent time after mapping creation #1587

Closed

jlewi mentioned this issue Jun 2, 2019

upstream connect error or disconnect/reset before headers kubeflow/kubeflow#2852

Closed

stale bot added the stale Issue is stale and will be closed label Jul 2, 2019

stale bot closed this as completed Jul 9, 2019

lennon310 mentioned this issue Aug 15, 2019

High latency of REST locally SeldonIO/seldon-core#803

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent 503s sending requests to gRPC service. #504

Intermittent 503s sending requests to gRPC service. #504

thedevelopnik commented Jun 8, 2018 •

edited

Loading

thedevelopnik commented Jun 8, 2018

thedevelopnik commented Jun 8, 2018 •

edited

Loading

thedevelopnik commented Jun 12, 2018

thedevelopnik commented Jun 14, 2018

badoet commented Jul 30, 2018

AndrewFarley commented Oct 3, 2018

mianelson commented Oct 17, 2018 •

edited

Loading

Harishkrishna17 commented Dec 7, 2018

richarddli commented Dec 7, 2018

MathiasGr commented Dec 7, 2018

AndrewFarley commented Dec 7, 2018 •

edited

Loading

abaymar commented Dec 14, 2018 •

edited

Loading

ryandawsonuk commented Apr 4, 2019 •

edited

Loading

ryandawsonuk commented Apr 5, 2019 •

edited

Loading

richarddli commented Apr 5, 2019

ryandawsonuk commented Apr 5, 2019

iNoahNothing commented Apr 8, 2019

ryandawsonuk commented Apr 8, 2019 •

edited

Loading

ryandawsonuk commented Apr 10, 2019 •

edited

Loading

vaibhavrtk commented Apr 30, 2019

vaibhavrtk commented May 3, 2019

stale bot commented Jul 2, 2019

axsaucedo commented Aug 14, 2019 •

edited

Loading

Intermittent 503s sending requests to gRPC service. #504

Intermittent 503s sending requests to gRPC service. #504

Comments

thedevelopnik commented Jun 8, 2018 • edited Loading

thedevelopnik commented Jun 8, 2018

thedevelopnik commented Jun 8, 2018 • edited Loading

thedevelopnik commented Jun 12, 2018

thedevelopnik commented Jun 14, 2018

badoet commented Jul 30, 2018

AndrewFarley commented Oct 3, 2018

mianelson commented Oct 17, 2018 • edited Loading

Harishkrishna17 commented Dec 7, 2018

richarddli commented Dec 7, 2018

MathiasGr commented Dec 7, 2018

AndrewFarley commented Dec 7, 2018 • edited Loading

abaymar commented Dec 14, 2018 • edited Loading

ryandawsonuk commented Apr 4, 2019 • edited Loading

ryandawsonuk commented Apr 5, 2019 • edited Loading

richarddli commented Apr 5, 2019

ryandawsonuk commented Apr 5, 2019

iNoahNothing commented Apr 8, 2019

ryandawsonuk commented Apr 8, 2019 • edited Loading

ryandawsonuk commented Apr 10, 2019 • edited Loading

vaibhavrtk commented Apr 30, 2019

vaibhavrtk commented May 3, 2019

stale bot commented Jul 2, 2019

axsaucedo commented Aug 14, 2019 • edited Loading

thedevelopnik commented Jun 8, 2018 •

edited

Loading

thedevelopnik commented Jun 8, 2018 •

edited

Loading

mianelson commented Oct 17, 2018 •

edited

Loading

AndrewFarley commented Dec 7, 2018 •

edited

Loading

abaymar commented Dec 14, 2018 •

edited

Loading

ryandawsonuk commented Apr 4, 2019 •

edited

Loading

ryandawsonuk commented Apr 5, 2019 •

edited

Loading

ryandawsonuk commented Apr 8, 2019 •

edited

Loading

ryandawsonuk commented Apr 10, 2019 •

edited

Loading

axsaucedo commented Aug 14, 2019 •

edited

Loading