Debugging 503s #3771

dnivra26 · 2018-07-02T14:07:59Z

Title: Debugging 503s

Description:
We have a service which is returning 503s for like 1% of the total traffic.
This target service has a replication of 100 and the calling service has a replication of 50.
We have updated the circuit_breaking configuration as below

circuit_breakers:
             thresholds:
              -
               priority: DEFAULT
               max_connections: 10000
               max_pending_requests: 10000
               max_requests: 10000
               max_retries: 3

The traffic would be around 10,000 requests/second.
And the latency is around 200ms. So i guess this configuration is sufficient enough to handle this traffic.

From the stats,
there is no upstream_rq_pending_overflow.
We have upstream_rq_pending_failure_eject and upstream_cx_connect_timeout.
I can understand upstream_cx_connect_timeout and we have a connection timeout of 0.25s.
But what could be the other reasons for upstream_rq_pending_failure_eject?
Also any suggestions to debug this 503 issues would be really helpful.

The text was updated successfully, but these errors were encountered:

bmadhavan · 2018-07-03T00:33:27Z

Is envoy returning 503 or does your service return 503 ?
I have seen 503s returned by envoy when upstream silently closes an idle connection but envoy is not aware of this event and tries to send a new request on this connection. Configuring retry_policy at the route level or configuring keepalives for the upstream cluster helped.

dnivra26 · 2018-07-03T06:07:08Z

Hi @bmadhavan, envoy is returning the 503s

bwplotka · 2018-07-03T17:07:14Z

We have similar question on our side. We do NOT have any circuit breakers set explicitly and we have pretty low traffic, however we are getting 503 occasionally from gRPC reqeust between two envoys.

My application see:
rpc error: code = FailedPrecondition desc = transport: received the unexpected content-type \"text/plain\""

I can see on "sidecar" envoy:

[2018-07-03 15:26:43.991][14][debug][router] source/common/router/router.cc:298] [C151754][S2751502683535461939]   'x-request-id':'ab61454a-cdfd-4b4a-a351-c5f9182484f0'
[2018-07-03 15:26:43.991][14][debug][pool] source/common/http/http2/conn_pool.cc:97] [C151755] creating stream
[2018-07-03 15:26:43.991][14][debug][router] source/common/router/router.cc:966] [C151754][S2751502683535461939] pool ready
[2018-07-03 15:26:43.991][14][debug][http] source/common/http/conn_manager_impl.cc:790] [C151754][S2751502683535461939] request end stream
[2018-07-03 15:26:43.992][14][debug][router] source/common/router/router.cc:590] [C151757][S15483029364081032919] upstream headers complete: end_stream=false
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:963] [C151757][S15483029364081032919] encoding headers via codec (end_stream=false):
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919]   ':status':'503'
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919]   'content-length':'19'
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919]   'content-type':'text/plain'
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919]   'date':'Tue, 03 Jul 2018 15:26:43 GMT'
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919]   'server':'envoy'
[2018-07-03 15:26:43.992][14][debug][http] source/common/http/conn_manager_impl.cc:968] [C151757][S15483029364081032919]   'x-envoy-upstream-service-time':'0'
[2018-07-03 15:26:43.992][14][debug][client] source/common/http/codec_client.cc:94] [C151755] response complete
[2018-07-03 15:26:43.992][14][debug][pool] source/common/http/http2/conn_pool.cc:189] [C151755] destroying stream: 1 remaining
[2018-07-03 15:26:43.992][14][debug][http2] source/common/http/http2/codec_impl.cc:517] [C151757] stream closed: 0
[2018-07-03 15:26:43.992][14][debug][http2] source/common/http/http2/codec_impl.cc:517] [C151755] stream closed: 0

Cannot see anything corresponding on target envoy (next hop). Looks like either 503 was returned without any log or sidecar envoy reponsed with 503 immdiately?

Cannot see anything in metrics described here: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/circuit_breaking

..apart from drop in rate(envoy_cluster_upstream_rq_total[1m]) because of missed 1 request (my app is doing 2 request per every 5s)

This is interesting though 'x-envoy-upstream-service-time':'0'

Any clue what am I hitting here?

I have seen 503s returned by envoy when upstream silently closes an idle connection but envoy is not aware of this event and tries to send a new request on this connection. Configuring retry_policy at the route level or configuring keepalives for the upstream cluster helped.

This might be related...

bwplotka · 2018-07-03T17:17:00Z

BTW is there any way to improve the response? To return an actual grpc response on such error (or any other error)?

bmadhavan · 2018-07-03T18:30:59Z

@dnivra26 @Bplotka You can enable HTTP access logging to get theresponse_flagswhich typically lists the reason for the error code. https://www.envoyproxy.io/docs/envoy/latest/configuration/access_log#format-rules

dnivra26 · 2018-07-04T08:03:46Z

@bmadhavan so from the access logs the response code for the 503s are

UC: Upstream connection termination in addition to 503 response code

UF: Upstream connection failure in addition to 503 response code

with UC being 75%.

what could be the reasons for connection termination? the traffic is under limit

htuch · 2018-07-04T10:59:23Z

Also related:

[Feature] add ability to serve custom error responses #1178
Return ResponseFlags in response headers #3239
I know @junr03 and @vilobhmm are working on solutions in this space.

bmadhavan · 2018-07-04T21:55:23Z

@dnivra26 Might be this.#2715
try turning on the keepalives and check if that helps.

vilobhmm · 2018-07-06T02:23:01Z

@htuch : ack

dnivra26 · 2018-07-07T23:03:05Z

found these in envoy logs. disconnect. resetting 1 pending requests. a bunch of these. could this be the reason for 503s? if so, how are these caused and what can be done to fix them? @bmadhavan @mattklein123

[2018-07-07 22:58:34.373][126][debug][client] source/common/http/codec_client.cc:25] [C34258] connecting
[2018-07-07 22:58:34.373][126][debug][client] source/common/http/codec_client.cc:63] [C34258] connected
[2018-07-07 22:58:34.389][132][debug][client] source/common/http/codec_client.cc:94] [C23628] response complete
[2018-07-07 22:58:34.389][126][debug][client] source/common/http/codec_client.cc:94] [C34258] response complete
[2018-07-07 22:58:34.893][130][debug][client] source/common/http/codec_client.cc:94] [C23471] response complete
[2018-07-07 22:58:34.893][132][debug][client] source/common/http/codec_client.cc:94] [C34058] response complete
[2018-07-07 22:58:35.132][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:35.205][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:35.244][132][debug][client] source/common/http/codec_client.cc:94] [C23630] response complete
[2018-07-07 22:58:35.244][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:35.291][132][debug][client] source/common/http/codec_client.cc:94] [C23010] response complete
[2018-07-07 22:58:35.291][132][debug][client] source/common/http/codec_client.cc:94] [C34058] response complete
[2018-07-07 22:58:35.885][132][debug][client] source/common/http/codec_client.cc:81] [C34113] disconnect. resetting 1 pending requests
[2018-07-07 22:58:35.885][132][debug][client] source/common/http/codec_client.cc:104] [C34113] request reset
[2018-07-07 22:58:35.885][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:36.205][132][debug][client] source/common/http/codec_client.cc:94] [C34058] response complete
[2018-07-07 22:58:36.234][132][debug][client] source/common/http/codec_client.cc:94] [C23255] response complete
[2018-07-07 22:58:36.234][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:36.443][132][debug][client] source/common/http/codec_client.cc:81] [C34116] disconnect. resetting 1 pending requests
[2018-07-07 22:58:36.443][132][debug][client] source/common/http/codec_client.cc:104] [C34116] request reset
[2018-07-07 22:58:36.443][132][debug][client] source/common/http/codec_client.cc:94] [C34058] response complete
[2018-07-07 22:58:36.552][132][debug][client] source/common/http/codec_client.cc:81] [C34118] disconnect. resetting 1 pending requests
[2018-07-07 22:58:36.552][132][debug][client] source/common/http/codec_client.cc:104] [C34118] request reset
[2018-07-07 22:58:36.552][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:36.774][132][debug][client] source/common/http/codec_client.cc:94] [C23261] response complete
[2018-07-07 22:58:36.774][130][debug][client] source/common/http/codec_client.cc:94] [C33487] response complete
[2018-07-07 22:58:36.853][130][debug][client] source/common/http/codec_client.cc:94] [C23478] response complete

dnivra26 · 2018-07-07T23:29:55Z

more logs. queueing request due to no available connections
from envoy stats active connections did not even cross ~400. and the max connections by default circuit breaking is 1024. what could be the reasons for envoy saying no available connections?

[2018-07-07 23:12:40.396][130][debug][pool] source/common/http/http1/conn_pool.cc:88] [C36833] using existing connection
[2018-07-07 23:12:40.397][130][debug][pool] source/common/http/http1/conn_pool.cc:206] [C36833] response complete
[2018-07-07 23:12:40.397][130][debug][pool] source/common/http/http1/conn_pool.cc:244] [C36833] moving to ready
[2018-07-07 23:12:40.398][132][debug][pool] source/common/http/http1/conn_pool.cc:88] [C35752] using existing connection
[2018-07-07 23:12:40.405][130][debug][pool] source/common/http/http1/conn_pool.cc:79] creating a new connection
[2018-07-07 23:12:40.405][130][debug][pool] source/common/http/http1/conn_pool.cc:105] queueing request due to no available connections
[2018-07-07 23:12:40.406][130][debug][pool] source/common/http/http1/conn_pool.cc:249] [C36950] attaching to next request
[2018-07-07 23:12:40.415][130][debug][pool] source/common/http/http1/conn_pool.cc:206] [C36950] response complete
[2018-07-07 23:12:40.415][130][debug][pool] source/common/http/http1/conn_pool.cc:244] [C36950] moving to ready
[2018-07-07 23:12:40.415][132][debug][pool] source/common/http/http1/conn_pool.cc:206] [C35752] response complete
[2018-07-07 23:12:40.415][132][debug][pool] source/common/http/http1/conn_pool.cc:244] [C35752] moving to ready
[2018-07-07 23:12:40.459][132][debug][pool] source/common/http/http1/conn_pool.cc:88] [C35752] using existing connection

bmadhavan · 2018-07-10T06:51:39Z

Envoy does connection pooling to upstream. So queueing request due to no available connections only means that there are no established connections to re-use & hence will initiate a new connection to the upstream. This shd not cause any issue.
For 503s I did recommend a few settings before.

stale · 2018-08-09T07:46:58Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

stale · 2018-08-16T07:59:18Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

vaibhavrtk · 2019-04-24T08:08:45Z

I am still facing this issue. Anyone found any solution for this? @dnivra26 @bwplotka

neufeldtech · 2019-07-12T20:36:58Z

For others that may be facing similar symptoms (low, but constant rate of HTTP 503's), be sure to check out the idle_timeout_ms setting introduced in Ambassador 0.60+. https://www.getambassador.io/reference/timeouts/#idle-timeout-idle_timeout_ms

For example, we had a node.js application that appeared to have an idle timeout set to 5 seconds, but ambassador's default idle timeout is set to 5 minutes. After setting this to 4 seconds idle_timeout_ms: 4000 for this service, our HTTP 503's went away. Ensure that you idle timeout on all proxies "in front of" your services have shorter idle timeouts than the ones behind them.

byustephen · 2021-05-07T16:09:22Z

We had a similar issue with our cluster. After redeploying services, and deleting pods like cilium, envoy, contour, kube-proxy, one of our DevOps engineers did a rolling rebuild of the cluster and resolved the issue.

kops rolling-update cluster --force --yes

This was a mind boggling issue, where we had intermittent slowness, 503's, queued requests, and more. But after the rolling update we were good to go.

danielhochman added the question Questions that are neither investigations, bugs, nor enhancements label Jul 2, 2018

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Aug 9, 2018

stale bot closed this as completed Aug 16, 2018

MathiasGr mentioned this issue Dec 7, 2018

Intermittent 503s sending requests to gRPC service. emissary-ingress/emissary#504

Closed

mbody mentioned this issue Dec 24, 2021

Intermittent error : Broken Pipe / Unknown error code 104 details Connection reset by peer / 504 Gateway timeout istio/istio#36630

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debugging 503s #3771

Debugging 503s #3771

dnivra26 commented Jul 2, 2018

bmadhavan commented Jul 3, 2018

dnivra26 commented Jul 3, 2018

bwplotka commented Jul 3, 2018 •

edited

Loading

bwplotka commented Jul 3, 2018

bmadhavan commented Jul 3, 2018 •

edited

Loading

dnivra26 commented Jul 4, 2018

htuch commented Jul 4, 2018

bmadhavan commented Jul 4, 2018

vilobhmm commented Jul 6, 2018

dnivra26 commented Jul 7, 2018

dnivra26 commented Jul 7, 2018 •

edited

Loading

bmadhavan commented Jul 10, 2018

stale bot commented Aug 9, 2018

stale bot commented Aug 16, 2018

vaibhavrtk commented Apr 24, 2019

neufeldtech commented Jul 12, 2019 •

edited

Loading

byustephen commented May 7, 2021

Debugging 503s #3771

Debugging 503s #3771

Comments

dnivra26 commented Jul 2, 2018

bmadhavan commented Jul 3, 2018

dnivra26 commented Jul 3, 2018

bwplotka commented Jul 3, 2018 • edited Loading

bwplotka commented Jul 3, 2018

bmadhavan commented Jul 3, 2018 • edited Loading

dnivra26 commented Jul 4, 2018

htuch commented Jul 4, 2018

bmadhavan commented Jul 4, 2018

vilobhmm commented Jul 6, 2018

dnivra26 commented Jul 7, 2018

dnivra26 commented Jul 7, 2018 • edited Loading

bmadhavan commented Jul 10, 2018

stale bot commented Aug 9, 2018

stale bot commented Aug 16, 2018

vaibhavrtk commented Apr 24, 2019

neufeldtech commented Jul 12, 2019 • edited Loading

byustephen commented May 7, 2021

bwplotka commented Jul 3, 2018 •

edited

Loading

bmadhavan commented Jul 3, 2018 •

edited

Loading

dnivra26 commented Jul 7, 2018 •

edited

Loading

neufeldtech commented Jul 12, 2019 •

edited

Loading