Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add index readiness check to data refresh #2209

Merged
merged 19 commits into from
Jun 20, 2023

Conversation

stacimc
Copy link
Collaborator

@stacimc stacimc commented May 27, 2023

Fixes

Fixes #2071 by @dhruvkb

Description

Adds an index_readiness_check task to the data refresh DAGs (including the create_filtered_<media>_index DAGs), used to detect issues where the index is ready, but the API is not connecting properly and returning results. The task is an HttpSensor which runs after ingest_upstream completes (creating the new es index), and queries the API using the new internal__index param to target the new index directly queries the new ES index, passing only when we retrieve a threshold number of hits. Once it passes, the DAG is able to move on to the promote steps and actually promote the indices in production.

Screen Shot 2023-06-15 at 4 31 32 PM

Only the data refresh TaskGroup is shown.

Screen Shot 2023-06-20 at 3 01 07 PM

The Sensor times out in 24 hours, so the entire DAG will fail (without promoting the new index) and notify in Slack if the ES index does not become ready within that timeframe. After manual intervention, the DAG can be safely manually resumed from this point.

Testing Instructions

Setup

  • Add ES_INDEX_READINESS_RECORD_COUNT=1000 to your catalog/.env. This is used in local testing to lower the threshold record count, since local test data does not have as many records.

  • Set up the ES connection by adding the following to catalog/.env: AIRFLOW_CONN_ELASTICSEARCH_HTTP_PRODUCTION=http://es:9200

  • Run just down && just up

  • Enable the create_filtered_audio_index and create_filtered_image_index DAGs

Tests

  1. For each media type, run the associated data_refresh DAG and ensure that the whole DAG passes.
  2. Test "unhealthy" results: the easiest way to do this is to manually edit the ES_INDEX_READINESS_RECORD_COUNT record to a larger number (100_000). Then run the DAG and observe that the index_readiness_check continuously goes up_for_reschedule and does not pass.
  3. Test 404s for invalid index: the easiest way to simulate this is to hardcode the index_readiness_check task to query with an incorrect index name. Manually edit the task here:
...
endpoint=f"{media_type}-foo/_search",  # This index does not exist
method="GET",
...

Re-run the DAG and observe that the task continuously goes up_for_reschedule and does not pass.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@stacimc stacimc added 🟧 priority: high Stalls work on the project or its dependents 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels May 27, 2023
@stacimc stacimc requested review from a team as code owners May 27, 2023 00:29
@stacimc stacimc self-assigned this May 27, 2023
@stacimc stacimc requested review from obulat and dhruvkb May 27, 2023 00:29
@github-actions github-actions bot added the 🧱 stack: api Related to the Django API label May 27, 2023
@AetherUnbound
Copy link
Collaborator

I'm testing this out locally right now, but wanted to surface something - the Oauth tokens we receive from the API aren't indefinite, right? I think they have an expiration time (my local one was 43200s), which means that the token will only be valid for a short time. We may need to dust off the Oauth DAGs in order to get this working without intervention 😮

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! I was able to run all the testing instructions locally successfully. I have a few comments but nothing blocking. I do, however, think that the token piece will require a follow up PR for managing the Oauth registration.

If an access token is invalid or expired querying the media endpoints using internal__index will not throw an error: it will silently query against the live table. This could obviously be misleading in the health check.

This seems like behavior we'd want to change on the API; returning a 400 would probably be best in the case that a token is invalid for that particular parameter I think, especially given that the token we have will eventually expire.

catalog/dags/common/constants.py Outdated Show resolved Hide resolved
catalog/dags/common/ingestion_server.py Outdated Show resolved Hide resolved
poke_interval=poke_interval,
timeout=timeout.total_seconds(),
retries=retries,
retry_exponential_backoff=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, very neat!

catalog/tests/dags/common/test_ingestion_server.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be wrong about this, but I think the current implementation incorrectly uses the API token/misses an important implementation detail.

To make authenticated requests to the API, a new token must be retrieved using the API credentials. These tokens expire after a set period of time. See the response body documented in https://api.openverse.engineering/v1/#tag/auth/operation/token which includes expires_in. The token must be refreshed by making a new API token request each time. Even our own frontend does this token refresh dance:

const refreshApiAccessToken = async (
clientId: string,
clientSecret: string
) => {
const formData = new URLSearchParams()
formData.append("client_id", clientId)
formData.append("client_secret", clientSecret)
formData.append("grant_type", "client_credentials")
const apiService = createApiService()
try {
const res = await apiService.post<TokenResponse>(
"auth_tokens/token",
formData
)
process.tokenData.accessToken = res.data.access_token
process.tokenData.accessTokenExpiry = currTimestamp() + res.data.expires_in
} catch (e) {
/**
* If an error occurs, serve the current request (and any pending)
* anonymously and hope it works. By setting the expiry to 0 we queue
* up another token fetch attempt for the next request.
*/
error("Unable to retrieve API token, clearing existing token", e)
process.tokenData.accessToken = ""
process.tokenData.accessTokenExpiry = 0
;(e as AxiosError).message = `Unable to retrieve API token. ${
(e as AxiosError).message
}`
throw e
}
}

Based on what I've seen of how the AwsHook works from the AWS provider, I think a new Hook is the right way to manage that kind of thing. The hook would need the client ID and secret in order to make the token refresh request. Ideally the newest token would be stored somewhere in the database (or shared memory? Redis?) and reused if possible or refreshed before a request is sent.

Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions:

  • Is it possible to derive index "readiness" from the Elasticsearch API? If so, it could be a more reliable and technically "correct" indicator than a rather generic threshold of responses to expect. This approach works, but it also required complicating the API by adding the internal__index param, which, if not needed, would be nice to remove and reduce the API complexity. Though, I don't know if that parameter is only needed for this, or if this is just one of several use cases for it.
  • At a glance, I think these changes don't make Delay origin index alias promotion until after filtered index is created #2314 more complicated. Can you double-check to make sure that is the case? These changes, if anything, probably make it slightly easier, but I could be wrong about that and just want to have it in mind so as not to create and inadvertent snags for the other issue down the line.

@openverse-bot
Copy link
Collaborator

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@obulat
@dhruvkb
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 9 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2.

@stacimc, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

@stacimc stacimc marked this pull request as draft June 9, 2023 00:02
@stacimc
Copy link
Collaborator Author

stacimc commented Jun 9, 2023

Drafting to address concerns.

I noted the issue with the token refresh and was hoping there was some kind of lever available to us already for bypassing that internally, or else that automation could be moved to a follow up :/ We can definitely generate/refresh the tokens in Airflow -- the existing oauth2 dags are a great place to start. Noting though that this is blocking the image data refresh in the meantime.

Is it possible to derive index "readiness" from the Elasticsearch API? If so, it could be a more reliable and technically "correct" indicator than a rather generic threshold of responses to expect. This approach works, but it also required complicating the API by adding the internal__index param, which, if not needed, would be nice to remove and reduce the API complexity. Though, I don't know if that parameter is only needed for this, or if this is just one of several use cases for it.

Really good question @sarayourfriend -- I'm not sure! I had not considered alternatives to the proposed implementation. I believe that is the only intended use of internal__index -- as noted though, I think we'll have to update it to not silently pass through to the default index if the token is invalid, so that's a further complication.

We could query the elasticsearch index directly instead of going through the API -- this would eliminate the need for the internal__index param and all of the auth concerns. The catalog does not currently have a connection to the ES cluster but will as part of #2133.

Also, doing a bit of research, I'm curious if what we need in the first place is to add the refresh parameter to our bulk insert, possibly set to wait_for. The documentation is a little unclear about where it's supported -- it's mentioned in bulk but not parallel-bulk but I believe it should just pass through. I will continue looking at this.

@sarayourfriend
Copy link
Collaborator

sarayourfriend commented Jun 9, 2023

The workers should ping /worker_finished when they're done, which triggers the refresh:

https://github.com//WordPress/openverse/blob/4d6e995d2268dcd71381bce61b51b8eb558d7adf/ingestion_server/ingestion_server/api.py#L264-L313

We shouldn't refresh during the bulk insert because the cost of each refresh can be offset until all documents are inserted and indexed (i.e., when /worker_finished is called by the last worker).

It could be that somehow the task_data.percent_successful == 100 didn't pass when the last worker called the endpoint, I'm just not sure exactly how that would happen or how we could debug whether that was the case.

Querying ES directly sounds like an excellent alternative to avoid all the API auth issues though 👍 There is no workaround for this on the API side and I don't think it would be correct to implement tokens with infinite life times. The potential for accidental misuse is too great.

@zackkrida
Copy link
Member

A small question here: is health_check appropriate or does it cause too much confusion with our existing service-level healthchecks for the API, frontend, etc? Would "index readiness check" or something potentially be more appropriate?

@stacimc stacimc force-pushed the add/api-healthcheck-to-data-refresh branch from 502a387 to 4725cb2 Compare June 15, 2023 23:27
@stacimc stacimc changed the title Add API healthcheck to data refresh Add index readiness check to data refresh Jun 15, 2023
@stacimc stacimc marked this pull request as ready for review June 16, 2023 00:03
@stacimc stacimc requested a review from AetherUnbound June 16, 2023 00:03
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API related environment variables can be removed now, right?

catalog/dags/common/ingestion_server.py Outdated Show resolved Hide resolved
catalog/dags/common/ingestion_server.py Outdated Show resolved Hide resolved
catalog/tests/dags/common/test_ingestion_server.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this locally and it works great, including the failure states.

One question: are we confident that index "readiness" can be sufficiently deduced from cluster health? I can't remember from the original incident whether the cluster was reporting green health even while it wasn't returning any search results.

I was also curious if the cluster would go yellow at the expected time, but to test that I had to have a green cluster locally. I adapted this docker-compose ES configuration to get a three node cluster running locally with the diff below (collapsed to save space):

Multi-node local ES cluster with green health
diff --git a/docker-compose.yml b/docker-compose.yml
index 3304ac042..36d0b0bc1 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -177,30 +177,65 @@ services:
     ports:
       - "50263:6379"
 
-  es:
-    profiles:
-      - ingestion_server
-      - api
+  es01:
+    networks:
+      default:
+        aliases:
+          - es
     image: docker.elastic.co/elasticsearch/elasticsearch:7.12.0
+    container_name: es01
+    environment:
+      - node.name=es01
+      - cluster.name=openverse-local
+      - discovery.seed_hosts=es02,es03
+      - cluster.initial_master_nodes=es01,es02,es03
+      - bootstrap.memory_lock=true
+      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
+    depends_on:
+      - es02
+      - es03
+    ulimits:
+      memlock:
+        soft: -1
+        hard: -1
+    volumes:
+      - es-data01:/usr/share/elasticsearch/data
     ports:
       - "50292:9200"
-    env_file:
-      - docker/es/env.docker
-    healthcheck:
-      test:
-        [
-          "CMD-SHELL",
-          "curl -si -XGET 'localhost:9200/_cluster/health?pretty' | grep -qE 'yellow|green'",
-        ]
-      interval: 10s
-      timeout: 60s
-      retries: 10
+    
+  es02:
+    image: docker.elastic.co/elasticsearch/elasticsearch:7.12.0
+    container_name: es02
+    environment:
+      - node.name=es02
+      - cluster.name=openverse-local
+      - discovery.seed_hosts=es01,es03
+      - cluster.initial_master_nodes=es01,es02,es03
+      - bootstrap.memory_lock=true
+      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
     ulimits:
-      nofile:
-        soft: 65536
-        hard: 65536
+      memlock:
+        soft: -1
+        hard: -1
+    volumes:
+      - es-data02:/usr/share/elasticsearch/data
+    
+  es03:
+    image: docker.elastic.co/elasticsearch/elasticsearch:7.12.0
+    container_name: es03
+    environment:
+      - node.name=es03
+      - cluster.name=openverse-local
+      - discovery.seed_hosts=es01,es02
+      - cluster.initial_master_nodes=es01,es02,es03
+      - bootstrap.memory_lock=true
+      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
+    ulimits:
+      memlock:
+        soft: -1
+        hard: -1
     volumes:
-      - es-data:/usr/share/elasticsearch/data
+      - es-data03:/usr/share/elasticsearch/data
 
   web:
     profiles:
@@ -219,7 +254,7 @@ services:
       - "50230:3000" # Sphinx (unused by default; see `sphinx-live` recipe)
     depends_on:
       - db
-      - es
+      - es01
       - cache
     env_file:
       - api/env.docker
@@ -246,7 +281,7 @@ services:
     depends_on:
       - db
       - upstream_db
-      - es
+      - es01
       - indexer_worker
     volumes:
       - ./ingestion_server:/ingestion_server
@@ -272,7 +307,7 @@ services:
     depends_on:
       - db
       - upstream_db
-      - es
+      - es01
     volumes:
       - ./ingestion_server:/ingestion_server
     env_file:
@@ -300,6 +335,8 @@ volumes:
   catalog-postgres:
   plausible-postgres:
   plausible-clickhouse:
-  es-data:
+  es-data01:
+  es-data02:
+  es-data03:
   minio:
   catalog-cache:

When I run the reindex, it does go yellow at the time we expect here.

So my only question is whether it's still worth querying the new index directly to see if it reports hits matching/exceeding the threshold.

Also, as a follow-up to this change, are we able to remove exact_index after this or do we need to keep it for anything else?

I'll approve the questions are cleared up and unneeded environment variables etc are cleaned up (if we don't need them anymore).

@stacimc
Copy link
Collaborator Author

stacimc commented Jun 16, 2023

One question: are we confident that index "readiness" can be sufficiently deduced from cluster health? I can't remember from the original incident whether the cluster was reporting green health even while it wasn't returning any search results.
...
When I run the reindex, it does go yellow at the time we expect here.

I was under this impression from the incident report and this seems like it should be the case from the elasticsearch documentation. However I went back through Slack just now to triple check and saw a comment suggesting the cluster was green when it was returning no documents 😱

I've updated this to check for hits from the index instead. I don't think it's necessary to keep the cluster health check as well, if it indeed cannot be relied upon.

Also, as a follow-up to this change, are we able to remove exact_index after this or do we need to keep it for anything else?

I believe we can and I intend to open that as a follow-up PR, yes.

@AetherUnbound
Copy link
Collaborator

AetherUnbound commented Jun 16, 2023

It seems this needs the elasticsearch_production connection, do you mind adding the steps for defining that in the testing instructions? Additionally, we're going to have an Elasticsearch-type (rather than HTTP-type) connection added as part of #2371, should we distinguish this connection type in some way? Possibly elasticsearch_http_production?

@dhruvkb dhruvkb removed their request for review June 19, 2023 04:07
@stacimc
Copy link
Collaborator Author

stacimc commented Jun 19, 2023

@obulat Can you check that you recreated the catalog/.env file (rm catalog/.env && cp catalog/env.template catalog/.env && just down && just up) and created the data refresh pool? I always get weird behaviour when I forget either of those things even try to rerun.

I opened #2434 to prevent the data refresh pool from missing locally.

@sarayourfriend FWIW #2352 made it so that you don't need to set up the data_refresh pool in local development, which I rebased this branch specifically to pick up because it's so annoying :)

@obulat Some of the env variable names changed during development -- the instructions are up to date but maybe you had an outdated connection id? If recreating the catalog/.env file and just down && just up doesn't work, I'd also recommend running just init to ensure you have some data in the catalog & api dbs.

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@obulat Some of the env variable names changed during development -- the instructions are up to date but maybe you had an outdated connection id? If recreating the catalog/.env file and just down && just up doesn't work, I'd also recommend running just init to ensure you have some data in the catalog & api dbs.

I checked and re-checked the env variables and the data_refresh pool, but apparently missed something. Ran the commands @sarayourfriend listed to recreate the .env file, and everything ran very well! Excited to start running refreshes again 🚀

@sarayourfriend
Copy link
Collaborator

@sarayourfriend FWIW #2352 made it so that you don't need to set up the data_refresh pool in local development, which I rebased this branch specifically to pick up because it's so annoying :)

!! Thanks for letting me know, I guess that makes the other issue moot 🙂

@sarayourfriend
Copy link
Collaborator

@stacimc I just realised that we probably need the same check in filtered index creation for newly created filtered indexes, right?

@stacimc stacimc force-pushed the add/api-healthcheck-to-data-refresh branch from b67ff80 to 8ee2719 Compare June 20, 2023 22:03
@stacimc
Copy link
Collaborator Author

stacimc commented Jun 20, 2023

@stacimc I just realised that we probably need the same check in filtered index creation for newly created filtered indexes, right?

Very good call! Done.

Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@stacimc stacimc merged commit 475f464 into main Jun 20, 2023
@stacimc stacimc deleted the add/api-healthcheck-to-data-refresh branch June 20, 2023 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Check API queries against the new index before promotion
6 participants