Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set matching cache status http code #330

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

ruial
Copy link

@ruial ruial commented Aug 30, 2022

This endpoint always returns status 200 even if 1 or more caches are not alive. With this change, if 1 of the caches is dead, it will return status 500, which can then be used by health checking tools to restart the service.

I have previously monkey patched #292 into the service I maintain but I still found issues where 1+ cache was not restarted correctly. This is hard to reproduce and happens from time to time. With this change, I no longer need to restart the Metaflow service manually.

@jackie-ob
Copy link
Contributor

jackie-ob commented Aug 30, 2022

It feels a little janky to return a 500 status code just as a function of the response content here. Is it a possibility for the health check tool to use other info on the response? E.g.

  • Health check tool inspects the body and does the all(is_alive) check itself.
  • Metadata service can set an all_caches_alive flag either in response body, or in response header.
  • A new endpoint made to indicate the need to restart. We have /ping for this purpose today. However we can consider another endpoint with more sophisticated internal criteria.

@ruial
Copy link
Author

ruial commented Aug 30, 2022

Hey @jackie-ob , setting the HTTP status code >= 400 is the standard way to signal that a service is unhealthy (in AWS ELB and k8s):

In the case of ELB I don't think the response body can be evaluated to determine the service health.

You guys are also using this technique correctly in other places, see https://github.com/Netflix/metaflow-service/blob/master/services/metadata_service/api/admin.py#L59

Please let me know if this MR makes sense or if I should change something.

@romain-intel
Copy link
Contributor

What issues were you seeing in the cache? I believe they may be addressed in #327 which should keep the caches alive.

@jackie-ob
Copy link
Contributor

@ruial we are planning to ship #327 as @romain-intel mentioned. Today. Would you like to see if that makes the need for this PR go away altogether?

If it is still a problem, our preference would be to create a new endpoint (similar to what you referenced)

@ruial
Copy link
Author

ruial commented Aug 30, 2022

Sure, I can try the new version, I think the fix from https://github.com/Netflix/metaflow-service/pull/292/files#diff-17bee700a8b2eb91cc38dcb3d7e30cd88769966fdb21c724393a5746f664d3cf is not included tho. A new endpoint in the future with the purpose of health checking the cache (and maybe the database) would also work. The current issue that I have is that all the cache workers subprocesses get terminated and don't restart, so everything in the UI gets a CacheServerUnreachable exception. I don't know how to reproduce the issue as it happens rarely.

Everything else with Metaflow works well, the only issues we have currently are the cache. Thank you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants