-
Notifications
You must be signed in to change notification settings - Fork 687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent 503s sending requests to gRPC service. #504
Comments
After more investigation, I would re-state the issue as the first call on a channel/connection intermittently fails. The way the node client works it opens up a channel for each service. It looks like its only ever the first call on these channels that fail (though even the first call sometimes succeeds). When the first call fails subsequent calls on that channel all seem to succeed. |
example logs from a failed call from our go client:
|
We updated to Ambassador 0.34.2, using Envoy 1.6.0 and it looks like this has gone away. Need to do more extended testing to confirm. |
This ended up being a conflict between Envoy and kube-router. Closing this ticket. |
@thedevelopnik we encountering similar problem as well to our kube cluster |
+1 we unexpectedly had this problem recently also, with no logs or explanations. Restarting the Ambassador pods fixed our situation, but we're worried it will unexpectedly happen again. Logs were not useful or helpful, just Ambassador stopped routing and started 503-ing, although ours was not intermittent it was a drop-off and all requests 503'd. |
+1 we began having this problem very recently, without any clear indication of what's causing the failures. we have a fairly low request volume passing through ambassador during testing (<100req/m). redeploying the pods resolved the issue for us. same exact behavior as described by @AndrewFarley above. we're using ambassador v0.40 and kubernetes v1.10.7 |
+1, We are facing the same issue with version 0.40.2, I am able to reproduce this when there are no requests for the service for a couple of hours. In order to resolve this issue, we are planning to have a retry mechanism when there is a 503 error, But this doesn't seem to be the solution to this problem Any help from the ambassador team will be much appreciated , |
If anyone has a) a reproducer and/or b) can try testing and reproducing on 0.50-EA7, that would be really helpful! Thanks. |
Same here - could this be related to envoyproxy/envoy#3771 (comment) ? |
^^ Unsure if that is the same issue, it could be. When this last occurred for us once it was because we were using AuthService and one of the ambasador pods was failing auth service for no good reason but its live/readyness was still passing despite auth failing. We shelled into this pod and was able to reach the auth service just fine, it was just Ambassador being silly, but this is what probes are supposed to be for eh instead of just 503-ing everything. The other time this occurred for us was when a bad annotation was pushed from our developers in our dev environment. Ambassador usually keeps working just fine for a few hours with a bad config in place, but we (had) no monitoring at the time to detect the bad annotation and eventually Ambassador just collapsed and started 503ing. This might be related to the envoy issue above. Restarting new pods didn't help in this case because of course then it said bad config and didn't know what the last "good" config was in a fresh pod. We've since added some monitoring to detect bad annotations. |
I'm having a problem with intermittent 503 ("upstream connect error or disconnect/reset before headers"), especially with a cold service start. Setting timeout_ms in my service route annotation doesn't help (e.g. I set it to 30000, but I get 503 within 3 seconds still). The service pod is on Windows nodes. Another, less random, pattern happens when I deploy a Linux pod to kube-system namespace. 3 times out of 4 the pod would never see incoming requests (which result in 503), and if it fails the first time, it never works. Restarting both the service pod and Ambassador pod a few times eventually convinces the route to work - and then it works like a charm until I need to deploy a new container version, which leads back to 503. Ambassador v0.40.2 + acs-engine + k8s v1.10.0 + Azure. |
Using ambassador 0.52.0 and seeing these intermittent errors especially on first startup. Not every time but often enough for it to be very noticeable in our suite of end to end tests. We've had to build retries into the tests that allow for an especially long delay in the retries just for grpc calls - SeldonIO/seldon-core#492 (comment). In that reference there's a link to a failure trace where it failed even with We are having to downgrade back to what we were using (0.40.2) as we weren't seeing this issue with this kind of seriousness on that version. |
I've tried more versions and 0.50.0 seems fine whereas 0.51.0 has the grpc startup issue I just mentioned with the 503 happening repeatedly for the first minute or so. During that time the ambassador logs show lots of:
I don't see that logging at all on 0.50.0. I am using the stable/ambassador helm chart with the default of container port as 8080 mapped to 80. Given that error relates to port access, perhaps it will work in 0.60.0 when the default port changes 7ac8f4a#diff-4ac32a78649ca5bdd8e0ba38b7006a1eR5 |
That is probably unrelated in the sense that you should just fix your permissions (or update Ambassador not to run on port 80). |
Yeah it's ok if I run as root. So we will do that for now rather than downgrade. But we're using the helm chart defaults (i.e. 8080) and not setting anything to use port 80 so we shoudn't really have to run as root. Our code and configuration is open source if you want to double-check. |
@ryandawsonuk I have taken a look at your Ambassador values.yaml configuration and am having a hard time understanding how you are seeing the Could you post a replicator so I can dig a little further into what is causing this issue? |
The best test I can offer right now is to clone https://github.com/SeldonIO/seldon-core and modify the helm chart to remove the root permission. Then go to testing/scripts and, with a kube context set as an empty minikube or cluster, run Given that the default port is changing in v0.60.0 I'll be happy to test on that version when it's available. Also happy to chat on slack |
After more testing it turns out the 503 issue is still there even as root. The warning we were seeing as non-root was not part of the cause (at least not directly). SeldonIO/seldon-core#473 (comment) explains. But it seems to be better in v0.50.0 so we're having to downgrade for now and will retest on v0.60.0 when that is available. |
Hi I am still seeing this issue. |
I found this issue to be related to IPVS. https://success.docker.com/article/ipvs-connection-timeout-issue |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Reopening (as per #1587 & SeldonIO/seldon-core#473) |
Describe the bug
We get intermittent 503s when sending requests through Ambassador to a grpc service. Often they will come as the first few requests after the service receives no requests for awhile, but they will still come intermittently.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Ambassador responds as available for all requests to an established service.
Versions (please complete the following information):
Additional context
There is an AWS NLB in front of Ambassador, and we have TLS configured. TLS is terminated both in Ambassador and in the application, using the same certificate. Hitting the service through an AWS NLB without Ambassador does not have this problem. I may be able to provide the test client code we are using so you can hit our service, but since this is internal proprietary code it's not something I can just point to here. I can't guarantee that the reproduction steps above will reproduce.
Looking through the ambassador code it looks like the only place it's set to send 503s is on its health check or using the auth service, and we're not using the auth service, so this might be an underlying Envoy issue?
The text was updated successfully, but these errors were encountered: