Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debugging 503s #3771

Closed
dnivra26 opened this issue Jul 2, 2018 · 17 comments
Closed

Debugging 503s #3771

dnivra26 opened this issue Jul 2, 2018 · 17 comments
Labels
question Questions that are neither investigations, bugs, nor enhancements stale stalebot believes this issue/PR has not been touched recently

Comments

@dnivra26
Copy link

dnivra26 commented Jul 2, 2018

Title: Debugging 503s

Description:
We have a service which is returning 503s for like 1% of the total traffic.
This target service has a replication of 100 and the calling service has a replication of 50.
We have updated the circuit_breaking configuration as below

circuit_breakers:
             thresholds:
              -
               priority: DEFAULT
               max_connections: 10000
               max_pending_requests: 10000
               max_requests: 10000
               max_retries: 3

The traffic would be around 10,000 requests/second.
And the latency is around 200ms. So i guess this configuration is sufficient enough to handle this traffic.

From the stats,
there is no upstream_rq_pending_overflow.
We have upstream_rq_pending_failure_eject and upstream_cx_connect_timeout.
I can understand upstream_cx_connect_timeout and we have a connection timeout of 0.25s.
But what could be the other reasons for upstream_rq_pending_failure_eject?
Also any suggestions to debug this 503 issues would be really helpful.

@danielhochman danielhochman added the question Questions that are neither investigations, bugs, nor enhancements label Jul 2, 2018
@bmadhavan
Copy link
Contributor

Is envoy returning 503 or does your service return 503 ?
I have seen 503s returned by envoy when upstream silently closes an idle connection but envoy is not aware of this event and tries to send a new request on this connection. Configuring retry_policy at the route level or configuring keepalives for the upstream cluster helped.

@dnivra26
Copy link
Author

dnivra26 commented Jul 3, 2018

Hi @bmadhavan, envoy is returning the 503s

@bwplotka
Copy link

bwplotka commented Jul 3, 2018

We have similar question on our side. We do NOT have any circuit breakers set explicitly and we have pretty low traffic, however we are getting 503 occasionally from gRPC reqeust between two envoys.

My application see:
rpc error: code = FailedPrecondition desc = transport: received the unexpected content-type \"text/plain\""

I can see on "sidecar" envoy:

[2018-07-03 15:26:43.991][14][debug][router] source/common/router/router.cc:298] [C151754][S2751502683535461939]   'x-request-id':'ab61454a-cdfd-4b4a-a351-c5f9182484f0'
[2018-07-03 15:26:43.991][14][debug][pool] source/common/http/http2/conn_pool.cc:97] [C151755] creating stream
[2018-07-03 15:26:43.991][14][debug][router] source/common/router/router.cc:966] [C151754][S2751502683535461939] pool ready
[2018-07-03 15:26:43.991][14][debug][http] source/common/http/conn_manager_impl.cc:790] [C151754][S2751502683535461939] request end stream
[2018-07-03 15:26:43.992][14][debug][router] source/common/router/router.cc:590] [C151757][S15483029364081032919] upstream headers complete: end_stream=false
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:963] [C151757][S15483029364081032919] encoding headers via codec (end_stream=false):
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919]   ':status':'503'
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919]   'content-length':'19'
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919]   'content-type':'text/plain'
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919]   'date':'Tue, 03 Jul 2018 15:26:43 GMT'
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919]   'server':'envoy'
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919]   'x-envoy-upstream-service-time':'0'
[2018-07-03 15:26:43.992][14][debug][client] source/common/http/codec_client.cc:94] [C151755] response complete
[2018-07-03 15:26:43.992][14][debug][pool] source/common/http/http2/conn_pool.cc:189] [C151755] destroying stream: 1 remaining
[2018-07-03 15:26:43.992][14][debug][http2] source/common/http/http2/codec_impl.cc:517] [C151757] stream closed: 0
[2018-07-03 15:26:43.992][14][debug][http2] source/common/http/http2/codec_impl.cc:517] [C151755] stream closed: 0

Cannot see anything corresponding on target envoy (next hop). Looks like either 503 was returned without any log or sidecar envoy reponsed with 503 immdiately?

Cannot see anything in metrics described here: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/circuit_breaking

..apart from drop in rate(envoy_cluster_upstream_rq_total[1m]) because of missed 1 request (my app is doing 2 request per every 5s)

This is interesting though 'x-envoy-upstream-service-time':'0'

Any clue what am I hitting here?

I have seen 503s returned by envoy when upstream silently closes an idle connection but envoy is not aware of this event and tries to send a new request on this connection. Configuring retry_policy at the route level or configuring keepalives for the upstream cluster helped.

This might be related...

@bwplotka
Copy link

bwplotka commented Jul 3, 2018

BTW is there any way to improve the response? To return an actual grpc response on such error (or any other error)?

@bmadhavan
Copy link
Contributor

bmadhavan commented Jul 3, 2018

@dnivra26 @Bplotka You can enable HTTP access logging to get theresponse_flagswhich typically lists the reason for the error code. https://www.envoyproxy.io/docs/envoy/latest/configuration/access_log#format-rules

@dnivra26
Copy link
Author

dnivra26 commented Jul 4, 2018

@bmadhavan so from the access logs the response code for the 503s are

UC: Upstream connection termination in addition to 503 response code

UF: Upstream connection failure in addition to 503 response code

with UC being 75%.

what could be the reasons for connection termination? the traffic is under limit

@htuch
Copy link
Member

htuch commented Jul 4, 2018

Also related:

@bmadhavan
Copy link
Contributor

@dnivra26 Might be this.#2715
try turning on the keepalives and check if that helps.

@vilobhmm
Copy link

vilobhmm commented Jul 6, 2018

@htuch : ack

@dnivra26
Copy link
Author

dnivra26 commented Jul 7, 2018

found these in envoy logs. disconnect. resetting 1 pending requests. a bunch of these. could this be the reason for 503s? if so, how are these caused and what can be done to fix them? @bmadhavan @mattklein123

[2018-07-07 22:58:34.373][126][debug][client] source/common/http/codec_client.cc:25] [C34258] connecting
[2018-07-07 22:58:34.373][126][debug][client] source/common/http/codec_client.cc:63] [C34258] connected
[2018-07-07 22:58:34.389][132][debug][client] source/common/http/codec_client.cc:94] [C23628] response complete
[2018-07-07 22:58:34.389][126][debug][client] source/common/http/codec_client.cc:94] [C34258] response complete
[2018-07-07 22:58:34.893][130][debug][client] source/common/http/codec_client.cc:94] [C23471] response complete
[2018-07-07 22:58:34.893][132][debug][client] source/common/http/codec_client.cc:94] [C34058] response complete
[2018-07-07 22:58:35.132][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:35.205][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:35.244][132][debug][client] source/common/http/codec_client.cc:94] [C23630] response complete
[2018-07-07 22:58:35.244][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:35.291][132][debug][client] source/common/http/codec_client.cc:94] [C23010] response complete
[2018-07-07 22:58:35.291][132][debug][client] source/common/http/codec_client.cc:94] [C34058] response complete
[2018-07-07 22:58:35.885][132][debug][client] source/common/http/codec_client.cc:81] [C34113] disconnect. resetting 1 pending requests
[2018-07-07 22:58:35.885][132][debug][client] source/common/http/codec_client.cc:104] [C34113] request reset
[2018-07-07 22:58:35.885][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:36.205][132][debug][client] source/common/http/codec_client.cc:94] [C34058] response complete
[2018-07-07 22:58:36.234][132][debug][client] source/common/http/codec_client.cc:94] [C23255] response complete
[2018-07-07 22:58:36.234][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:36.443][132][debug][client] source/common/http/codec_client.cc:81] [C34116] disconnect. resetting 1 pending requests
[2018-07-07 22:58:36.443][132][debug][client] source/common/http/codec_client.cc:104] [C34116] request reset
[2018-07-07 22:58:36.443][132][debug][client] source/common/http/codec_client.cc:94] [C34058] response complete
[2018-07-07 22:58:36.552][132][debug][client] source/common/http/codec_client.cc:81] [C34118] disconnect. resetting 1 pending requests
[2018-07-07 22:58:36.552][132][debug][client] source/common/http/codec_client.cc:104] [C34118] request reset
[2018-07-07 22:58:36.552][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:36.774][132][debug][client] source/common/http/codec_client.cc:94] [C23261] response complete
[2018-07-07 22:58:36.774][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:36.853][130][debug][client] source/common/http/codec_client.cc:94] [C23478] response complete

@dnivra26
Copy link
Author

dnivra26 commented Jul 7, 2018

more logs. queueing request due to no available connections
from envoy stats active connections did not even cross ~400. and the max connections by default circuit breaking is 1024. what could be the reasons for envoy saying no available connections?

[2018-07-07 23:12:40.396][130][debug][pool] source/common/http/http1/conn_pool.cc:88] [C36833] using existing connection
[2018-07-07 23:12:40.397][130][debug][pool] source/common/http/http1/conn_pool.cc:206] [C36833] response complete
[2018-07-07 23:12:40.397][130][debug][pool] source/common/http/http1/conn_pool.cc:244] [C36833] moving to ready
[2018-07-07 23:12:40.398][132][debug][pool] source/common/http/http1/conn_pool.cc:88] [C35752] using existing connection
[2018-07-07 23:12:40.405][130][debug][pool] source/common/http/http1/conn_pool.cc:79] creating a new connection
[2018-07-07 23:12:40.405][130][debug][pool] source/common/http/http1/conn_pool.cc:105] queueing request due to no available connections
[2018-07-07 23:12:40.406][130][debug][pool] source/common/http/http1/conn_pool.cc:249] [C36950] attaching to next request
[2018-07-07 23:12:40.415][130][debug][pool] source/common/http/http1/conn_pool.cc:206] [C36950] response complete
[2018-07-07 23:12:40.415][130][debug][pool] source/common/http/http1/conn_pool.cc:244] [C36950] moving to ready
[2018-07-07 23:12:40.415][132][debug][pool] source/common/http/http1/conn_pool.cc:206] [C35752] response complete
[2018-07-07 23:12:40.415][132][debug][pool] source/common/http/http1/conn_pool.cc:244] [C35752] moving to ready
[2018-07-07 23:12:40.459][132][debug][pool] source/common/http/http1/conn_pool.cc:88] [C35752] using existing connection

@bmadhavan
Copy link
Contributor

Envoy does connection pooling to upstream. So queueing request due to no available connections only means that there are no established connections to re-use & hence will initiate a new connection to the upstream. This shd not cause any issue.
For 503s I did recommend a few settings before.

@stale
Copy link

stale bot commented Aug 9, 2018

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Aug 9, 2018
@stale
Copy link

stale bot commented Aug 16, 2018

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

@vaibhavrtk
Copy link

I am still facing this issue. Anyone found any solution for this? @dnivra26 @bwplotka

@neufeldtech
Copy link

neufeldtech commented Jul 12, 2019

For others that may be facing similar symptoms (low, but constant rate of HTTP 503's), be sure to check out the idle_timeout_ms setting introduced in Ambassador 0.60+. https://www.getambassador.io/reference/timeouts/#idle-timeout-idle_timeout_ms

For example, we had a node.js application that appeared to have an idle timeout set to 5 seconds, but ambassador's default idle timeout is set to 5 minutes. After setting this to 4 seconds idle_timeout_ms: 4000 for this service, our HTTP 503's went away. Ensure that you idle timeout on all proxies "in front of" your services have shorter idle timeouts than the ones behind them.

@byustephen
Copy link

We had a similar issue with our cluster. After redeploying services, and deleting pods like cilium, envoy, contour, kube-proxy, one of our DevOps engineers did a rolling rebuild of the cluster and resolved the issue.

kops rolling-update cluster --force --yes

This was a mind boggling issue, where we had intermittent slowness, 503's, queued requests, and more. But after the rolling update we were good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Questions that are neither investigations, bugs, nor enhancements stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

9 participants