Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom path for health check (eg. /healthz) #2938

Closed
Maxwell2022 opened this issue Apr 13, 2023 · 6 comments · Fixed by #4145
Closed

Custom path for health check (eg. /healthz) #2938

Maxwell2022 opened this issue Apr 13, 2023 · 6 comments · Fixed by #4145

Comments

@Maxwell2022
Copy link

Is your feature request related to a problem? Please describe.
I want to be able to set a custom path for the health check endpoint

Describe the solution you'd like

health_check:
  listen: 0.0.0.0:4000
  enabled: true
  path: /healthz

Describe alternatives you've considered
Update our Kubernetes configuration

Additional context
I'm migrating from @apollo/gateway to Apollo router. We should be able to deploy our new gateway (router) without changing the configuration of its corresponding service in kubernetes. We adopted a GCP standard for all our services, looking up the /healthz endpoint. Unfortunately Apollo router is hardcoding this endpoint to /health and therefore we cannot deploy with zero downtime.

@abernix
Copy link
Member

abernix commented Apr 13, 2023

Unfortunately Apollo router is hardcoding this endpoint to /health and therefore we cannot deploy with zero downtime.

I'm not sure I understand why there would be an inability to deploy with zero downtime on account of the health check endpoint path. Could you elaborate on that?

@Maxwell2022
Copy link
Author

Maxwell2022 commented Apr 13, 2023

Our pipeline is deploying services by updating the Kubernetes deployment configuration for the service we are deploying, updating the image version to use for the service (that's standard way to deploy Kubernetes services).

If we deploy this service with a new health check path, it will fail the deployment and revert back to previous version (because new pods will be considered not healthy).

If we update our Kubernetes configuration before we deploy the service, then it would fail the health check for our current gateway (which listen on /healthz).

We would have to deploy and update the Kubernetes configuration at the same time, and even this will be a problem because the old pod would start failing the health check while the new pods comes up. For reference it takes around ~1 to 2 minutes for AWS EKS to register a pod. I cannot see this happening without downtime.

Ideally we can setup this health check path so that there is no breaking change migrating from @apollo/gateway to Apollo Router

@garypen
Copy link
Contributor

garypen commented May 3, 2023

Hi @Maxwell2022. I wanted to be as close as I could be to 100% certain that changing liveness/readiness configuration didn't behave as you described, so I setup an experiment in a kubernetes cluster.

I created a custom router image with a health endpoint of /healthz. I then deployed this into my test cluster and observed it working as expected. Both liveness and readiness checks running against this custom router.

I then modified the deployment to use a standard router image (1.13.1, but any recent version would be fine) with a health endpoint of /health and updated the liveness and readiness checks from /healthz to /health. Here's what happened to the pods which I watched with the -w flag:

router-5c78545b7b-p545s   1/1     Running   0          85s
router-bf6ddccd8-vz2lb    0/1     Pending   0          0s
router-bf6ddccd8-vz2lb    0/1     Pending   0          0s
router-bf6ddccd8-vz2lb    0/1     ContainerCreating   0          0s
router-bf6ddccd8-vz2lb    0/1     Running             0          8s
router-bf6ddccd8-vz2lb    1/1     Running             0          8s
router-5c78545b7b-p545s   1/1     Terminating         0          6m23s
router-5c78545b7b-p545s   0/1     Terminating         0          6m24s
router-5c78545b7b-p545s   0/1     Terminating         0          6m24s
router-5c78545b7b-p545s   0/1     Terminating         0          6m24s

You'll notice the the new (/health) pod (vz2lb) has to start and transition to running before the old (/healthz) pod (p545s) is shut down. Whilst this transition was ongoing I had another terminal making queries against the router service and there was no interruption in service.

This is because with a Deployment with a default update strategy of RollingUpdate, old pods continue to execute with existing configuration until replacement pods (with the new configuration) transition to running. A good description of how this works.

Summary: You should be able to update your kubernetes configuration with confidence that it will work with no interruption in service, unless there is some other reason preventing this.

Note: I'm assuming your deployment uses the default RollingUpdate strategy and not Recreate. You don't mention that and it isn't widely used, but that would lead to interruption in service on changing any aspect of your deployment configuration.

@Maxwell2022
Copy link
Author

Hey @garypen, thanks for your message.

I think there are a few more things here to consider. The first one is the watch flag you share here is when deploying the new container, but what happens if you change the readiness probe from /healthz to /health without deploying anything just yet? I'm pretty sure all pods will be flagged as "unhealthy" and the service will not respond.

In your example, it takes 8 seconds to pull and start the new container, that's to me is unrealistic, we are closer to 120 seconds in AWS EKS using Fargate (we are using RollingUpdate too).

Now, I'm unsure what the complexity is or why there is a pushback to have this health check path configurable. I know that in our case it would help to have consistency in our services and how we deploy / smoke tests our services in CI. At the moment we have a bunch of conditions just for Apollo Router, it's not ideal.

That said if there are good reasons then I'm happy to close this issue.

@garypen
Copy link
Contributor

garypen commented May 4, 2023

Hi @Maxwell2022

There is no pushback to allow the health check to be configurable, I'm just trying to make clear for the benefit of anyone following this issue that there is no technical requirement for this to exist for a kubernetes deployment of the router replacing an existing gateway deployment.

Regarding changing the readiness probe without changing the container (i.e. the implementation). That is clearly a mis-configuration and won't work. It won't affect any of your existing deployed pods (assuming update strategy is RollingUpdate). That's not specific to a readiness check, that's just mis-configuring a deployment.

Regarding the amount of time to create new pods. That doesn't affect the update, so that's also not a factor. It will just take longer for the update to occur, but the existing pods (with existing configuration) will continue to operate until the new pods (with new configuration) become ready.

We'd be more than happy to take a PR which allows for configuring the end-point of the health check. We'd wish the current (fixed) endpoint to remain the default and the implementation would need to ensure that there is no clash with other configurable endpoints.

@Maxwell2022
Copy link
Author

I'm just trying to make clear for the benefit of anyone following this issue that there is no technical requirement for this to exist for a kubernetes deployment of the router replacing an existing gateway deployment.

True 💯, this was just our concern at the time. Thanks for clarifying this

We'd be more than happy to take a PR which allows for configuring the end-point of the health check

Might be an opportunity to jump on the crablang's bandwagon. I cannot promise anything tho.

Thanks again Gary

aaronArinder pushed a commit to aaronArinder/router that referenced this issue Nov 4, 2023
resolves apollographql#2938

Adds a configuration option for custom health check endpoints, using
`/health` as default
aaronArinder pushed a commit to aaronArinder/router that referenced this issue Nov 4, 2023
resolves apollographql#2938

Adds a configuration option for custom health check endpoints, using
`/health` as default
aaronArinder pushed a commit to aaronArinder/router that referenced this issue Nov 4, 2023
resolves apollographql#2938

Adds a configuration option for custom health check endpoints, using
`/health` as default
aaronArinder pushed a commit to aaronArinder/router that referenced this issue Nov 4, 2023
resolves apollographql#2938

Adds a configuration option for custom health check endpoints, using
`/health` as default
aaronArinder pushed a commit to aaronArinder/router that referenced this issue Nov 4, 2023
resolves apollographql#2938

Adds a configuration option for custom health check endpoints, using
`/health` as default
aaronArinder pushed a commit to aaronArinder/router that referenced this issue Nov 6, 2023
resolves apollographql#2938

Adds a configuration option for custom health check endpoints, using
`/health` as default
aaronArinder pushed a commit to aaronArinder/router that referenced this issue Nov 6, 2023
resolves apollographql#2938

Adds a configuration option for custom health check endpoints, using
`/health` as default
aaronArinder pushed a commit to aaronArinder/router that referenced this issue Nov 6, 2023
resolves apollographql#2938

Adds a configuration option for custom health check endpoints, using
`/health` as default
aaronArinder pushed a commit to aaronArinder/router that referenced this issue Nov 6, 2023
resolves apollographql#2938

Adds a configuration option for custom health check endpoints, using
`/health` as default
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants