Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent 503s sending requests to gRPC service. #504

Closed
thedevelopnik opened this issue Jun 8, 2018 · 23 comments
Closed

Intermittent 503s sending requests to gRPC service. #504

thedevelopnik opened this issue Jun 8, 2018 · 23 comments
Labels
stale Issue is stale and will be closed

Comments

@thedevelopnik
Copy link

thedevelopnik commented Jun 8, 2018

Describe the bug
We get intermittent 503s when sending requests through Ambassador to a grpc service. Often they will come as the first few requests after the service receives no requests for awhile, but they will still come intermittently.

To Reproduce
Steps to reproduce the behavior:

  1. Make a grpc server and put it behind ambassador.
  2. Send multiple requests in a row to the service from a client, does not have to be concurrent or high volume.
  3. See 503.

Expected behavior
Ambassador responds as available for all requests to an established service.

Versions (please complete the following information):

  • Ambassador: [0.34.0, 0.34.1]
  • Kubernetes environment [e.g. Kops managed cluster running in AWS, master is t2.medium, nodes are t2.large/t2.xlarge]
  • Version [e.g. 1.9.8]

Additional context
There is an AWS NLB in front of Ambassador, and we have TLS configured. TLS is terminated both in Ambassador and in the application, using the same certificate. Hitting the service through an AWS NLB without Ambassador does not have this problem. I may be able to provide the test client code we are using so you can hit our service, but since this is internal proprietary code it's not something I can just point to here. I can't guarantee that the reproduction steps above will reproduce.

Looking through the ambassador code it looks like the only place it's set to send 503s is on its health check or using the auth service, and we're not using the auth service, so this might be an underlying Envoy issue?

@thedevelopnik
Copy link
Author

After more investigation, I would re-state the issue as the first call on a channel/connection intermittently fails. The way the node client works it opens up a channel for each service. It looks like its only ever the first call on these channels that fail (though even the first call sometimes succeeds). When the first call fails subsequent calls on that channel all seem to succeed.

@thedevelopnik
Copy link
Author

thedevelopnik commented Jun 8, 2018

example logs from a failed call from our go client:

2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: wrote SETTINGS len=0
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: wrote HEADERS flags=END_HEADERS stream=1 len=211
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: wrote DATA flags=END_STREAM stream=1 len=18 data="\x00\x00\x00\x00\r\n\v\x12\tsomething"
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: read SETTINGS len=6, settings: INITIAL_WINDOW_SIZE=268435456
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: read WINDOW_UPDATE len=4 (conn) incr=268369921
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: wrote SETTINGS flags=ACK len=0
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: read SETTINGS flags=ACK len=0
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: read HEADERS flags=END_HEADERS stream=1 len=49
2018/06/08 10:08:33 http2: decoded hpack field header field ":status" = "503"
2018/06/08 10:08:33 http2: decoded hpack field header field "content-length" = "57"
2018/06/08 10:08:33 http2: decoded hpack field header field "content-type" = "text/plain"
2018/06/08 10:08:33 http2: decoded hpack field header field "date" = "Fri, 08 Jun 2018 16:08:32 GMT"
2018/06/08 10:08:33 http2: decoded hpack field header field "server" = "envoy"
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: read DATA flags=END_STREAM stream=1 len=57 data="upstream connect error or disconnect/reset before headers"
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: wrote RST_STREAM stream=1 len=4 ErrCode=PROTOCOL_ERROR
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: wrote WINDOW_UPDATE len=4 (conn) incr=57
2018/06/08 10:08:33 http2: Framer 0xc4201dc0e0: wrote PING len=8 ping="\x02\x04\x10\x10\t\x0e\a\a"
panic: rpc error: code = Internal desc = transport: received the unexpected content-type "text/plain"

@thedevelopnik
Copy link
Author

We updated to Ambassador 0.34.2, using Envoy 1.6.0 and it looks like this has gone away. Need to do more extended testing to confirm.

@thedevelopnik
Copy link
Author

This ended up being a conflict between Envoy and kube-router. Closing this ticket.

@badoet
Copy link

badoet commented Jul 30, 2018

@thedevelopnik we encountering similar problem as well to our kube cluster
care to share and elaborate on " conflict between Envoy and kube-router" ?
thank you 🙇

@AndrewFarley
Copy link

+1 we unexpectedly had this problem recently also, with no logs or explanations. Restarting the Ambassador pods fixed our situation, but we're worried it will unexpectedly happen again. Logs were not useful or helpful, just Ambassador stopped routing and started 503-ing, although ours was not intermittent it was a drop-off and all requests 503'd.

@mianelson
Copy link

mianelson commented Oct 17, 2018

+1 we began having this problem very recently, without any clear indication of what's causing the failures. we have a fairly low request volume passing through ambassador during testing (<100req/m). redeploying the pods resolved the issue for us. same exact behavior as described by @AndrewFarley above. we're using ambassador v0.40 and kubernetes v1.10.7

@Harishkrishna17
Copy link

+1, We are facing the same issue with version 0.40.2, I am able to reproduce this when there are no requests for the service for a couple of hours. In order to resolve this issue, we are planning to have a retry mechanism when there is a 503 error, But this doesn't seem to be the solution to this problem

Any help from the ambassador team will be much appreciated ,

@richarddli
Copy link
Contributor

If anyone has a) a reproducer and/or b) can try testing and reproducing on 0.50-EA7, that would be really helpful! Thanks.

@MathiasGr
Copy link

Same here - could this be related to envoyproxy/envoy#3771 (comment) ?

@AndrewFarley
Copy link

AndrewFarley commented Dec 7, 2018

^^ Unsure if that is the same issue, it could be.

When this last occurred for us once it was because we were using AuthService and one of the ambasador pods was failing auth service for no good reason but its live/readyness was still passing despite auth failing. We shelled into this pod and was able to reach the auth service just fine, it was just Ambassador being silly, but this is what probes are supposed to be for eh instead of just 503-ing everything.

The other time this occurred for us was when a bad annotation was pushed from our developers in our dev environment. Ambassador usually keeps working just fine for a few hours with a bad config in place, but we (had) no monitoring at the time to detect the bad annotation and eventually Ambassador just collapsed and started 503ing. This might be related to the envoy issue above. Restarting new pods didn't help in this case because of course then it said bad config and didn't know what the last "good" config was in a fresh pod. We've since added some monitoring to detect bad annotations.

@abaymar
Copy link

abaymar commented Dec 14, 2018

I'm having a problem with intermittent 503 ("upstream connect error or disconnect/reset before headers"), especially with a cold service start. Setting timeout_ms in my service route annotation doesn't help (e.g. I set it to 30000, but I get 503 within 3 seconds still). The service pod is on Windows nodes.

Another, less random, pattern happens when I deploy a Linux pod to kube-system namespace. 3 times out of 4 the pod would never see incoming requests (which result in 503), and if it fails the first time, it never works. Restarting both the service pod and Ambassador pod a few times eventually convinces the route to work - and then it works like a charm until I need to deploy a new container version, which leads back to 503.

Ambassador v0.40.2 + acs-engine + k8s v1.10.0 + Azure.

@ryandawsonuk
Copy link

ryandawsonuk commented Apr 4, 2019

Using ambassador 0.52.0 and seeing these intermittent errors especially on first startup. Not every time but often enough for it to be very noticeable in our suite of end to end tests. We've had to build retries into the tests that allow for an especially long delay in the retries just for grpc calls - SeldonIO/seldon-core#492 (comment). In that reference there's a link to a failure trace where it failed even with delay_since_first_attempt_ms = 44073. In those tests the grpc calls aren't even started until after REST calls have been made successfully. We never see this issue with REST.

We are having to downgrade back to what we were using (0.40.2) as we weren't seeing this issue with this kind of seriousness on that version.

@ryandawsonuk
Copy link

ryandawsonuk commented Apr 5, 2019

I've tried more versions and 0.50.0 seems fine whereas 0.51.0 has the grpc startup issue I just mentioned with the 503 happening repeatedly for the first minute or so. During that time the ambassador logs show lots of:

[2019-04-05 14:01:41.078][000083][warning][config] [bazel-out/k8-dbg/bin/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:70] gRPC config for type.googleapis.com/envoy.api.v2.Listener rejected: Error adding/updating listener ambassador-listener-80: cannot bind '0.0.0.0:80': Permission denied

I don't see that logging at all on 0.50.0. I am using the stable/ambassador helm chart with the default of container port as 8080 mapped to 80.

Given that error relates to port access, perhaps it will work in 0.60.0 when the default port changes 7ac8f4a#diff-4ac32a78649ca5bdd8e0ba38b7006a1eR5

@richarddli
Copy link
Contributor

That is probably unrelated in the sense that you should just fix your permissions (or update Ambassador not to run on port 80).

@ryandawsonuk
Copy link

Yeah it's ok if I run as root. So we will do that for now rather than downgrade. But we're using the helm chart defaults (i.e. 8080) and not setting anything to use port 80 so we shoudn't really have to run as root. Our code and configuration is open source if you want to double-check.

@iNoahNothing
Copy link
Contributor

@ryandawsonuk I have taken a look at your Ambassador values.yaml configuration and am having a hard time understanding how you are seeing the cannot bind '0.0.0.0:80' error. We've seen this happen since the helm chart is running as non-root by default so requires the service_port: 8080 annotation in the ambassador Module to configure Ambassador to listen on 8080 instead of 80.

Could you post a replicator so I can dig a little further into what is causing this issue?

@ryandawsonuk
Copy link

ryandawsonuk commented Apr 8, 2019

The best test I can offer right now is to clone https://github.com/SeldonIO/seldon-core and modify the helm chart to remove the root permission. Then go to testing/scripts and, with a kube context set as an empty minikube or cluster, run pytest -s -W ignore test_helm_charts_single_namespace.py

Given that the default port is changing in v0.60.0 I'll be happy to test on that version when it's available. Also happy to chat on slack

@ryandawsonuk
Copy link

ryandawsonuk commented Apr 10, 2019

After more testing it turns out the 503 issue is still there even as root. The warning we were seeing as non-root was not part of the cause (at least not directly). SeldonIO/seldon-core#473 (comment) explains. But it seems to be better in v0.50.0 so we're having to downgrade for now and will retest on v0.60.0 when that is available.

@vaibhavrtk
Copy link

Hi I am still seeing this issue.
Seems to me this issue is with the envoy. envoyproxy/envoy#6190.
They seem to have fixed it already. When will it be available in Ambassador?

@vaibhavrtk
Copy link

I found this issue to be related to IPVS. https://success.docker.com/article/ipvs-connection-timeout-issue

@stale
Copy link

stale bot commented Jul 2, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Issue is stale and will be closed label Jul 2, 2019
@stale stale bot closed this as completed Jul 9, 2019
@axsaucedo
Copy link

axsaucedo commented Aug 14, 2019

Reopening (as per #1587 & SeldonIO/seldon-core#473)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issue is stale and will be closed
Projects
None yet
Development

No branches or pull requests