Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingress Gateway don't wait for connections to drain and reported http status code 502 error by LoadBalancer #26302

Closed
kazukousen opened this issue Aug 8, 2020 · 5 comments

Comments

@kazukousen
Copy link

kazukousen commented Aug 8, 2020

Bug description
We are running Istio ingress gateway in AWS EKS.
We also have AWS ALB in front of ingress gateway and exposed by NodePort to support the Target Group (instance mode).

(the reason why we don't use alb-ingress is because when updating cluster, we can work more safely without it.)

We have noticed that the ALB reports status code 502 error on some requests when Istio ingress gateway's pod is terminating on rolling update, scale in, etc.
so we think the pod went down with the HTTP keep-alive connection remaining between ALB and ingress gateway, resulting in lost packets.

We first tried running sleep using preStop hook, increasing the terminationGracePeriodSeconds, setting the ISTIO_META_IDLE_TIMEOUT, and increasing the meshConfig.drainDurations, but there was no improvement.

We finally found a solution:
wait for connections to drain with the healthcheck/fail endpoint.
https://www.envoyproxy.io/docs/envoy/latest/operations/admin#operations-admin-interface-healthcheck-fail
after runs this endpoint, envoy responses Connection: close for already connections. This is the behavior we've been looking for.
https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/draining#arch-overview-draining

operator.yaml

profile: default
values:
  gateways:
    istio-ingressgateway:
      type: NodePort
components:
  ingressGateways:
    - name: istio-ingressgateway
      enabled: true
      k8s:
        overlays:
          - apiVersion: apps/v1
            kind: Deployment
            name: istio-ingressgateway
            patches:
              - path: spec.template.spec.containers.[name:istio-proxy].lifecycle
                value: {"preStop": {"exec": {"command": ["sh", "-c", "curl -X POST http://localhost:15000/healthcheck/fail && sleep 30"]}}}
              - path: spec.template.spec.terminationGracePeriodSeconds
                value: 120

this solution seems to be used by contour too. https://projectcontour.io/docs/master/redeploy-envoy/

With this solution, the 502 error has been successfully eliminated, but are we missing an important somewhere in the documentation?
If you have a more general solution for Istio users like us, I'd like to know about it. thank you.

Affected product area

[ ] Docs
[ ] Installation
[x] Networking
[ ] Performance and Scalability
[ ] Extensions and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Steps to reproduce the bug
Run istio ingress gateway, stand AWS ALB in front of it, make a request to the application and redeploy ingress gateway(e.g. kubectl rollout restart deployment istio-ingressgateway).

Version
AWS EKS 1.17
Istio 1.6.2

Installation
generated the yaml file by istioctl manifest generate -f from the operator yaml, and kubectl apply -f this.

Environment
AWS EKS

@kazukousen kazukousen changed the title Ingress Gateway don't wait for connections to drain and reports http status code 502 error by LoadBalancer Ingress Gateway don't wait for connections to drain and reported http status code 502 error by LoadBalancer Aug 8, 2020
@howardjohn
Copy link
Member

Can you try setting TERMINATION_DRAIN_DURATION_SECONDS=30s as env var on the ingress? This is basically the built in draining we do today. It defaults to 5s, possibly its too small

@kazukousen
Copy link
Author

@howardjohn thank you for your reply
hm... I tried istio-ingressgateway set env var as TERMINATION_DRAIN_DURATION_SECONDS=30s, but not improve
I found #26524, is it related to this?

@howardjohn
Copy link
Member

Yes, very likely the same. If you use the code from master you most likely will see no issues

@kazukousen
Copy link
Author

very nice, i'm looking forward to it, thanks

@howardjohn
Copy link
Member

Closing as this is fixed by #26524 in 1.8. Unfortunately we cannot backport due to dependencies on Envoy version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants