Ingress Gateway don't wait for connections to drain and reported http status code 502 error by LoadBalancer #26302

kazukousen · 2020-08-08T14:29:19Z

Bug description
We are running Istio ingress gateway in AWS EKS.
We also have AWS ALB in front of ingress gateway and exposed by NodePort to support the Target Group (instance mode).

(the reason why we don't use alb-ingress is because when updating cluster, we can work more safely without it.)

We have noticed that the ALB reports status code 502 error on some requests when Istio ingress gateway's pod is terminating on rolling update, scale in, etc.
so we think the pod went down with the HTTP keep-alive connection remaining between ALB and ingress gateway, resulting in lost packets.

We first tried running sleep using preStop hook, increasing the terminationGracePeriodSeconds, setting the ISTIO_META_IDLE_TIMEOUT, and increasing the meshConfig.drainDurations, but there was no improvement.

We finally found a solution:
wait for connections to drain with the healthcheck/fail endpoint.
https://www.envoyproxy.io/docs/envoy/latest/operations/admin#operations-admin-interface-healthcheck-fail
after runs this endpoint, envoy responses Connection: close for already connections. This is the behavior we've been looking for.
https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/draining#arch-overview-draining

operator.yaml

profile: default
values:
  gateways:
    istio-ingressgateway:
      type: NodePort
components:
  ingressGateways:
    - name: istio-ingressgateway
      enabled: true
      k8s:
        overlays:
          - apiVersion: apps/v1
            kind: Deployment
            name: istio-ingressgateway
            patches:
              - path: spec.template.spec.containers.[name:istio-proxy].lifecycle
                value: {"preStop": {"exec": {"command": ["sh", "-c", "curl -X POST http://localhost:15000/healthcheck/fail && sleep 30"]}}}
              - path: spec.template.spec.terminationGracePeriodSeconds
                value: 120

this solution seems to be used by contour too. https://projectcontour.io/docs/master/redeploy-envoy/

With this solution, the 502 error has been successfully eliminated, but are we missing an important somewhere in the documentation?
If you have a more general solution for Istio users like us, I'd like to know about it. thank you.

Affected product area

[ ] Docs
[ ] Installation
[x] Networking
[ ] Performance and Scalability
[ ] Extensions and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Steps to reproduce the bug
Run istio ingress gateway, stand AWS ALB in front of it, make a request to the application and redeploy ingress gateway(e.g. kubectl rollout restart deployment istio-ingressgateway).

Version
AWS EKS 1.17
Istio 1.6.2

Installation
generated the yaml file by istioctl manifest generate -f from the operator yaml, and kubectl apply -f this.

Environment
AWS EKS

The text was updated successfully, but these errors were encountered:

howardjohn · 2020-08-10T17:47:30Z

Can you try setting TERMINATION_DRAIN_DURATION_SECONDS=30s as env var on the ingress? This is basically the built in draining we do today. It defaults to 5s, possibly its too small

kazukousen · 2020-08-19T02:48:05Z

@howardjohn thank you for your reply
hm... I tried istio-ingressgateway set env var as TERMINATION_DRAIN_DURATION_SECONDS=30s, but not improve
I found #26524, is it related to this?

howardjohn · 2020-08-19T02:57:51Z

Yes, very likely the same. If you use the code from master you most likely will see no issues

kazukousen · 2020-08-19T04:34:48Z

very nice, i'm looking forward to it, thanks

howardjohn · 2020-10-23T15:24:12Z

Closing as this is fixed by #26524 in 1.8. Unfortunately we cannot backport due to dependencies on Envoy version

kazukousen changed the title ~~Ingress Gateway don't wait for connections to drain and reports http status code 502 error by LoadBalancer~~ Ingress Gateway don't wait for connections to drain and reported http status code 502 error by LoadBalancer Aug 8, 2020

ericvn added the area/networking label Aug 11, 2020

istio-policy-bot added the lifecycle/needs-triage label Aug 12, 2020

howardjohn removed the lifecycle/needs-triage label Oct 22, 2020

howardjohn closed this as completed Oct 23, 2020

shahriak mentioned this issue Mar 11, 2021

Change default value set for terminationDrainDuration #31383

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingress Gateway don't wait for connections to drain and reported http status code 502 error by LoadBalancer #26302

Ingress Gateway don't wait for connections to drain and reported http status code 502 error by LoadBalancer #26302

kazukousen commented Aug 8, 2020 •

edited

Loading

howardjohn commented Aug 10, 2020

kazukousen commented Aug 19, 2020

howardjohn commented Aug 19, 2020

kazukousen commented Aug 19, 2020

howardjohn commented Oct 23, 2020

Ingress Gateway don't wait for connections to drain and reported http status code 502 error by LoadBalancer #26302

Ingress Gateway don't wait for connections to drain and reported http status code 502 error by LoadBalancer #26302

Comments

kazukousen commented Aug 8, 2020 • edited Loading

howardjohn commented Aug 10, 2020

kazukousen commented Aug 19, 2020

howardjohn commented Aug 19, 2020

kazukousen commented Aug 19, 2020

howardjohn commented Oct 23, 2020

kazukousen commented Aug 8, 2020 •

edited

Loading