/healthcheck endpoint should check for Elasticsearch availability (original #487) #14

obulat · 2021-04-21T12:14:04Z

This issue has been migrated from the CC Search API repository

Author: aldenstpage
Date: Wed May 06 2020
Labels: ✨ goal: improvement,🏷 status: label work required,🙅 status: discontinued

During deployments, our load balancer repeatedly polls the /healthcheck endpoint to check that the server is reachable. If this check succeeds, the newly deployed instance starts receiving production traffic. Right now, if Elasticsearch is not responsive, /healthcheck will still return 200 OK.

The healthcheck endpoint should check the health of the image index in Elasticsearch using the cluster health API. If it is unavailable, return error 500. Log an informative message explaining why the healthcheck failed.

Because the healthcheck endpoint may be called many times, and Elasticsearch calls are not free, we should cache the response of Elasticsearch for up to 10 seconds per call.

Original Comments:

madewithkode commented on Fri May 08 2020:

Hi Alden, this looks interesting, I'd love to work on it.

source

madewithkode commented on Fri May 08 2020:

Hi Alden in order to check the health of the image index in the /healthcheck view, I'm trying to use the urllib's urlopen() method to make a request to Elasticsearch's cluster API this way:

cluster_response = urlopen('http://0.0.0.0:8000/_cluster/health/image')

However, I keep getting a 404. Is there something I'm doing wrong?
source

madewithkode commented on Fri May 08 2020:

Hi Alden in order to check the health of the image index in the /healthcheck view, I'm trying to use the urllib's urlopen() method to make a request to Elasticsearch's cluster API this way:

cluster_response = urlopen('http://0.0.0.0:8000/_cluster/health/image')

However, I keep getting a 404. Is there something I'm doing wrong?

Figured this, didn't know elastic search was running on a seperate host/port :)
source

aldenstpage commented on Fri May 08 2020:

That's great!

It would be best to use the equivalent elasticsearch-py or elasticsearch-dsl query instead of making direct calls to the REST API (you can get an instance of the connection to Elasticsearch from search_controller.py). Here's an example for getting the cluster health; there ought to also be a way to narrow the query to the image index.
source

madewithkode commented on Sat May 09 2020:

Alright...would look at the suggestion.

On Fri, May 8, 2020, 21:06 Alden S Page [email protected] wrote:

It would be best to use the equivalent elasticsearch-py query instead of
making direct calls to the REST API. Here's
https://discuss.elastic.co/t/how-to-get-cluster-health-using-python-api/25431
an example for getting the cluster health; there ought to also be a way to
narrow the query to the image index.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
cc-archive/cccatalog-api#487 (comment),
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AGFLMYA5WLAQPO5GYZNX5BTRQRQ5RANCNFSM4M2V5EKA
.

source

madewithkode commented on Sat May 09 2020:

Hi Alden, I'm here again :)
I'd love to ask, what status specifically signifies the availability of the image index.
red, yellow or green ? or should i leave out this detail in the query since I'm using an already established connection instance from search_controller.py which waits for a yellow status by default.
source

madewithkode commented on Sat May 09 2020:

Update:

I've successfully managed to query the health of the entire cluster, using the Elasticsearch connection instance gotten from search_controller.py. However when i try to limit the health check to just the image index, the request never resolves and continues to run forever with no response. And when i try to specify a timeout for the request, i get an "Illegal argument exception" even though timeout is a valid kwarg referenced in the API docs.

It'd be nice to point out that as at the time of writing, I'm yet to successfully run ./load_sample_data.sh so i don't know if this could be linked to the above problem.

source

madewithkode commented on Mon May 11 2020:

Hi Alden, Progress Report :)

Successfully got the load_sample_data.sh to run, and so far every other thing is working fine.
I've also set up the 10s response caching on the /healthcheck view using redis and also the error logging.

However, I figured out the reason for the unresponsiveness when querying the elastic search image index was that it was non-existent and that the whole cluster index was empty too.

Do I need to do a manual population or something?

source

aldenstpage commented on Mon May 11 2020:

Hi again Onyenanu – if the index doesn't exist, the healthcheck should fail. This could happen in situations where we are switching Elasticsearch clusters in production and forgot to index data into the new one (or something went wrong while we were loading data into the new cluster).

In my experience, the ES Python libs can behave in unexpected ways that you sometimes have to work around. Since it seems like querying specifically for the image index health hangs when the index doesn't exist, perhaps you could query for healthchecks of every index in the cluster, and fail the healthcheck if image is not among them and green?

It sounds like it's coming along nicely!
source

madewithkode commented on Tue May 12 2020:

Hi again Onyenanu – if the index doesn't exist, the healthcheck should fail. This could happen in situations where we are switching Elasticsearch clusters in production and forgot to index data into the new one (or something went wrong while we were loading data into the new cluster).

In my experience, the ES Python libs can behave in unexpected ways that you sometimes have to work around. Since it seems like querying specifically for the image index health hangs when the index doesn't exist, perhaps you could query for healthchecks of every index in the cluster, and fail the healthcheck if image is not among them and green?

It sounds like it's coming along nicely!

Hey Alden...Many thanks again for coming through with better insights. Suggestion sounds nice, would proceed with it.

And yes, the whole stuff is getting more interesting, learnt a handful in the few days :)
source

The text was updated successfully, but these errors were encountered:

sarayourfriend · 2022-12-15T06:14:19Z

@WordPress/openverse-api and @WordPress/openverse-infrastructure I've been thinking about this issue for the last 30 minutes. I have a basic implementation that should work fine, but the more I think about this and our infrastructure implementation, I would like to discuss the intention here further.

The primary difficulty I'm having in understanding exactly how we should expect this to work is that we use the healthcheck endpoint to notify the ASG/ECS Service to restart. If we always check for ES health on the /healthcheck/ endpoint, then it could happen that we end up with the API service restarting only because the ES cluster is unhealthy. In this case, the API service wouldn't be fixed by restarting it because the issue is with ES and restarting the API doesn't affect ES's condition.

However, because our ES cluster is not publicly accessible, we'd not be able to make public requests using something like uptime robot to check for ES health. I think it could be appropriate to proxy an ES healthcheck through the API healthcheck endpoint via an optional query param like check_es that is False by default. That way we can observe ES health specifically while keeping the API serving normal error responses without being infinitely restarted by ASG/ECS if ES is unhealthy.

Does this make sense to folks? I'll put up a draft PR with the code to do as I described. If anyone has other ideas for how this could be approached so that we have good visibility into ES health but don't cause unnecessary Django service restarts, please share them.

Update: the promised draft PR is here: #1047

AetherUnbound · 2022-12-15T19:48:58Z

Those are great points! Agreed that we would definitely not want to be erroneously restarting the API if Elasticsearch itself is failing.

I think your suggested approach is solid and I'll comment on the draft PR you've opened.

zackkrida · 2022-12-15T19:56:38Z

I like the query param idea; great suggestion.

sarayourfriend self-assigned this Dec 15, 2022

sarayourfriend added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository labels Dec 15, 2022

sarayourfriend mentioned this issue Dec 15, 2022

Add ES healthchecks to /healthcheck/ endpoint #1047

Merged

7 tasks

sarayourfriend closed this as completed in #1047 Jan 4, 2023

sarayourfriend mentioned this issue May 3, 2023

Add additional checks to ingestion server healthcheck endpoint WordPress/openverse#2019

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/healthcheck endpoint should check for Elasticsearch availability (original #487) #14

/healthcheck endpoint should check for Elasticsearch availability (original #487) #14

obulat commented Apr 21, 2021 •

edited by dhruvkb

Loading

Update:

sarayourfriend commented Dec 15, 2022 •

edited

Loading

AetherUnbound commented Dec 15, 2022

zackkrida commented Dec 15, 2022

/healthcheck endpoint should check for Elasticsearch availability (original #487) #14

/healthcheck endpoint should check for Elasticsearch availability (original #487) #14

Comments

obulat commented Apr 21, 2021 • edited by dhruvkb Loading

Because the healthcheck endpoint may be called many times, and Elasticsearch calls are not free, we should cache the response of Elasticsearch for up to 10 seconds per call.

Original Comments:

Update:

sarayourfriend commented Dec 15, 2022 • edited Loading

AetherUnbound commented Dec 15, 2022

zackkrida commented Dec 15, 2022

obulat commented Apr 21, 2021 •

edited by dhruvkb

Loading

sarayourfriend commented Dec 15, 2022 •

edited

Loading