Add index readiness check to data refresh #2209

stacimc · 2023-05-27T00:29:32Z

Fixes

Fixes #2071 by @dhruvkb

Description

Adds an index_readiness_check task to the data refresh DAGs (including the create_filtered_<media>_index DAGs), used to detect issues where the index is ready, but the API is not connecting properly and returning results. The task is an HttpSensor which runs after ingest_upstream completes (creating the new es index), and ~~queries the API using the new internal__index param to target the new index~~ directly queries the new ES index, passing only when we retrieve a threshold number of hits. Once it passes, the DAG is able to move on to the promote steps and actually promote the indices in production.

Only the data refresh TaskGroup is shown.

The Sensor times out in 24 hours, so the entire DAG will fail (without promoting the new index) and notify in Slack if the ES index does not become ready within that timeframe. After manual intervention, the DAG can be safely manually resumed from this point.

Testing Instructions

Setup

Add ES_INDEX_READINESS_RECORD_COUNT=1000 to your catalog/.env. This is used in local testing to lower the threshold record count, since local test data does not have as many records.
Set up the ES connection by adding the following to catalog/.env: AIRFLOW_CONN_ELASTICSEARCH_HTTP_PRODUCTION=http://es:9200
Run just down && just up
Enable the create_filtered_audio_index and create_filtered_image_index DAGs

Tests

For each media type, run the associated data_refresh DAG and ensure that the whole DAG passes.
Test "unhealthy" results: the easiest way to do this is to manually edit the ES_INDEX_READINESS_RECORD_COUNT record to a larger number (100_000). Then run the DAG and observe that the index_readiness_check continuously goes up_for_reschedule and does not pass.
Test 404s for invalid index: the easiest way to simulate this is to hardcode the index_readiness_check task to query with an incorrect index name. Manually edit the task here:

...
endpoint=f"{media_type}-foo/_search",  # This index does not exist
method="GET",
...

Re-run the DAG and observe that the task continuously goes up_for_reschedule and does not pass.

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

AetherUnbound · 2023-06-01T20:21:25Z

I'm testing this out locally right now, but wanted to surface something - the Oauth tokens we receive from the API aren't indefinite, right? I think they have an expiration time (my local one was 43200s), which means that the token will only be valid for a short time. We may need to dust off the Oauth DAGs in order to get this working without intervention 😮

AetherUnbound

This LGTM! I was able to run all the testing instructions locally successfully. I have a few comments but nothing blocking. I do, however, think that the token piece will require a follow up PR for managing the Oauth registration.

If an access token is invalid or expired querying the media endpoints using internal__index will not throw an error: it will silently query against the live table. This could obviously be misleading in the health check.

This seems like behavior we'd want to change on the API; returning a 400 would probably be best in the case that a token is invalid for that particular parameter I think, especially given that the token we have will eventually expire.

catalog/dags/common/constants.py

catalog/dags/common/ingestion_server.py

AetherUnbound · 2023-06-01T20:35:37Z

catalog/dags/common/ingestion_server.py

+        poke_interval=poke_interval,
+        timeout=timeout.total_seconds(),
+        retries=retries,
+        retry_exponential_backoff=True,


TIL, very neat!

catalog/dags/data_refresh/data_refresh_task_factory.py

catalog/tests/dags/common/test_ingestion_server.py

sarayourfriend

I may be wrong about this, but I think the current implementation incorrectly uses the API token/misses an important implementation detail.

To make authenticated requests to the API, a new token must be retrieved using the API credentials. These tokens expire after a set period of time. See the response body documented in https://api.openverse.engineering/v1/#tag/auth/operation/token which includes expires_in. The token must be refreshed by making a new API token request each time. Even our own frontend does this token refresh dance:

openverse/frontend/src/plugins/api-token.server.ts

Lines 71 to 102 in 697f62f

    
           const refreshApiAccessToken = async ( 
        
             clientId: string, 
        
             clientSecret: string 
        
           ) => { 
        
             const formData = new URLSearchParams() 
        
             formData.append("client_id", clientId) 
        
             formData.append("client_secret", clientSecret) 
        
             formData.append("grant_type", "client_credentials") 
        
             const apiService = createApiService() 
        
             try { 
        
               const res = await apiService.post<TokenResponse>( 
        
                 "auth_tokens/token", 
        
                 formData 
        
               ) 
        
               process.tokenData.accessToken = res.data.access_token 
        
               process.tokenData.accessTokenExpiry = currTimestamp() + res.data.expires_in 
        
             } catch (e) { 
        
               /** 
        
                * If an error occurs, serve the current request (and any pending) 
        
                * anonymously and hope it works. By setting the expiry to 0 we queue 
        
                * up another token fetch attempt for the next request. 
        
                */ 
        
               error("Unable to retrieve API token, clearing existing token", e) 
        
               process.tokenData.accessToken = "" 
        
               process.tokenData.accessTokenExpiry = 0 
        
               ;(e as AxiosError).message = `Unable to retrieve API token. ${ 
        
                 (e as AxiosError).message 
        
               }` 
        
               throw e 
        
             } 
        
           }

Based on what I've seen of how the AwsHook works from the AWS provider, I think a new Hook is the right way to manage that kind of thing. The hook would need the client ID and secret in order to make the token refresh request. Ideally the newest token would be stored somewhere in the database (or shared memory? Redis?) and reused if possible or refreshed before a request is sent.

sarayourfriend

Two questions:

Is it possible to derive index "readiness" from the Elasticsearch API? If so, it could be a more reliable and technically "correct" indicator than a rather generic threshold of responses to expect. This approach works, but it also required complicating the API by adding the internal__index param, which, if not needed, would be nice to remove and reduce the API complexity. Though, I don't know if that parameter is only needed for this, or if this is just one of several use cases for it.
At a glance, I think these changes don't make Delay origin index alias promotion until after filtered index is created #2314 more complicated. Can you double-check to make sure that is the case? These changes, if anything, probably make it slightly easier, but I could be wrong about that and just want to have it in mind so as not to create and inadvertent snags for the other issue down the line.

openverse-bot · 2023-06-09T00:00:12Z

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@obulat
@dhruvkb
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend¹ days, this PR was ready for review 9 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)².

@stacimc, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Specifically, Saturday and Sunday. ↩
For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range. ↩

stacimc · 2023-06-09T00:42:48Z

Drafting to address concerns.

I noted the issue with the token refresh and was hoping there was some kind of lever available to us already for bypassing that internally, or else that automation could be moved to a follow up :/ We can definitely generate/refresh the tokens in Airflow -- the existing oauth2 dags are a great place to start. Noting though that this is blocking the image data refresh in the meantime.

Is it possible to derive index "readiness" from the Elasticsearch API? If so, it could be a more reliable and technically "correct" indicator than a rather generic threshold of responses to expect. This approach works, but it also required complicating the API by adding the internal__index param, which, if not needed, would be nice to remove and reduce the API complexity. Though, I don't know if that parameter is only needed for this, or if this is just one of several use cases for it.

Really good question @sarayourfriend -- I'm not sure! I had not considered alternatives to the proposed implementation. I believe that is the only intended use of internal__index -- as noted though, I think we'll have to update it to not silently pass through to the default index if the token is invalid, so that's a further complication.

We could query the elasticsearch index directly instead of going through the API -- this would eliminate the need for the internal__index param and all of the auth concerns. The catalog does not currently have a connection to the ES cluster but will as part of #2133.

Also, doing a bit of research, I'm curious if what we need in the first place is to add the refresh parameter to our bulk insert, possibly set to wait_for. The documentation is a little unclear about where it's supported -- it's mentioned in bulk but not parallel-bulk but I believe it should just pass through. I will continue looking at this.

sarayourfriend · 2023-06-09T00:47:51Z

The workers should ping /worker_finished when they're done, which triggers the refresh:

https://github.com//WordPress/openverse/blob/4d6e995d2268dcd71381bce61b51b8eb558d7adf/ingestion_server/ingestion_server/api.py#L264-L313

We shouldn't refresh during the bulk insert because the cost of each refresh can be offset until all documents are inserted and indexed (i.e., when /worker_finished is called by the last worker).

It could be that somehow the task_data.percent_successful == 100 didn't pass when the last worker called the endpoint, I'm just not sure exactly how that would happen or how we could debug whether that was the case.

Querying ES directly sounds like an excellent alternative to avoid all the API auth issues though 👍 There is no workaround for this on the API side and I don't think it would be correct to implement tokens with infinite life times. The potential for accidental misuse is too great.

zackkrida · 2023-06-12T19:16:42Z

A small question here: is health_check appropriate or does it cause too much confusion with our existing service-level healthchecks for the API, frontend, etc? Would "index readiness check" or something potentially be more appropriate?

catalog/dags/common/constants.py

sarayourfriend

The API related environment variables can be removed now, right?

catalog/dags/common/ingestion_server.py

catalog/tests/dags/common/test_ingestion_server.py

sarayourfriend

I've tested this locally and it works great, including the failure states.

One question: are we confident that index "readiness" can be sufficiently deduced from cluster health? I can't remember from the original incident whether the cluster was reporting green health even while it wasn't returning any search results.

I was also curious if the cluster would go yellow at the expected time, but to test that I had to have a green cluster locally. I adapted this docker-compose ES configuration to get a three node cluster running locally with the diff below (collapsed to save space):

Multi-node local ES cluster with green health

diff --git a/docker-compose.yml b/docker-compose.yml
index 3304ac042..36d0b0bc1 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -177,30 +177,65 @@ services:
     ports:
       - "50263:6379"
 
-  es:
-    profiles:
-      - ingestion_server
-      - api
+  es01:
+    networks:
+      default:
+        aliases:
+          - es
     image: docker.elastic.co/elasticsearch/elasticsearch:7.12.0
+    container_name: es01
+    environment:
+      - node.name=es01
+      - cluster.name=openverse-local
+      - discovery.seed_hosts=es02,es03
+      - cluster.initial_master_nodes=es01,es02,es03
+      - bootstrap.memory_lock=true
+      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
+    depends_on:
+      - es02
+      - es03
+    ulimits:
+      memlock:
+        soft: -1
+        hard: -1
+    volumes:
+      - es-data01:/usr/share/elasticsearch/data
     ports:
       - "50292:9200"
-    env_file:
-      - docker/es/env.docker
-    healthcheck:
-      test:
-        [
-          "CMD-SHELL",
-          "curl -si -XGET 'localhost:9200/_cluster/health?pretty' | grep -qE 'yellow|green'",
-        ]
-      interval: 10s
-      timeout: 60s
-      retries: 10
+    
+  es02:
+    image: docker.elastic.co/elasticsearch/elasticsearch:7.12.0
+    container_name: es02
+    environment:
+      - node.name=es02
+      - cluster.name=openverse-local
+      - discovery.seed_hosts=es01,es03
+      - cluster.initial_master_nodes=es01,es02,es03
+      - bootstrap.memory_lock=true
+      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
     ulimits:
-      nofile:
-        soft: 65536
-        hard: 65536
+      memlock:
+        soft: -1
+        hard: -1
+    volumes:
+      - es-data02:/usr/share/elasticsearch/data
+    
+  es03:
+    image: docker.elastic.co/elasticsearch/elasticsearch:7.12.0
+    container_name: es03
+    environment:
+      - node.name=es03
+      - cluster.name=openverse-local
+      - discovery.seed_hosts=es01,es02
+      - cluster.initial_master_nodes=es01,es02,es03
+      - bootstrap.memory_lock=true
+      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
+    ulimits:
+      memlock:
+        soft: -1
+        hard: -1
     volumes:
-      - es-data:/usr/share/elasticsearch/data
+      - es-data03:/usr/share/elasticsearch/data
 
   web:
     profiles:
@@ -219,7 +254,7 @@ services:
       - "50230:3000" # Sphinx (unused by default; see `sphinx-live` recipe)
     depends_on:
       - db
-      - es
+      - es01
       - cache
     env_file:
       - api/env.docker
@@ -246,7 +281,7 @@ services:
     depends_on:
       - db
       - upstream_db
-      - es
+      - es01
       - indexer_worker
     volumes:
       - ./ingestion_server:/ingestion_server
@@ -272,7 +307,7 @@ services:
     depends_on:
       - db
       - upstream_db
-      - es
+      - es01
     volumes:
       - ./ingestion_server:/ingestion_server
     env_file:
@@ -300,6 +335,8 @@ volumes:
   catalog-postgres:
   plausible-postgres:
   plausible-clickhouse:
-  es-data:
+  es-data01:
+  es-data02:
+  es-data03:
   minio:
   catalog-cache:

When I run the reindex, it does go yellow at the time we expect here.

So my only question is whether it's still worth querying the new index directly to see if it reports hits matching/exceeding the threshold.

Also, as a follow-up to this change, are we able to remove exact_index after this or do we need to keep it for anything else?

I'll approve the questions are cleared up and unneeded environment variables etc are cleaned up (if we don't need them anymore).

stacimc · 2023-06-16T18:03:52Z

One question: are we confident that index "readiness" can be sufficiently deduced from cluster health? I can't remember from the original incident whether the cluster was reporting green health even while it wasn't returning any search results.
...
When I run the reindex, it does go yellow at the time we expect here.

I was under this impression from the incident report and this seems like it should be the case from the elasticsearch documentation. However I went back through Slack just now to triple check and saw a comment suggesting the cluster was green when it was returning no documents 😱

I've updated this to check for hits from the index instead. I don't think it's necessary to keep the cluster health check as well, if it indeed cannot be relied upon.

Also, as a follow-up to this change, are we able to remove exact_index after this or do we need to keep it for anything else?

I believe we can and I intend to open that as a follow-up PR, yes.

AetherUnbound · 2023-06-16T18:49:50Z

It seems this needs the elasticsearch_production connection, do you mind adding the steps for defining that in the testing instructions? Additionally, we're going to have an Elasticsearch-type (rather than HTTP-type) connection added as part of #2371, should we distinguish this connection type in some way? Possibly elasticsearch_http_production?

stacimc · 2023-06-19T18:06:23Z

@obulat Can you check that you recreated the catalog/.env file (rm catalog/.env && cp catalog/env.template catalog/.env && just down && just up) and created the data refresh pool? I always get weird behaviour when I forget either of those things even try to rerun.

I opened #2434 to prevent the data refresh pool from missing locally.

@sarayourfriend FWIW #2352 made it so that you don't need to set up the data_refresh pool in local development, which I rebased this branch specifically to pick up because it's so annoying :)

@obulat Some of the env variable names changed during development -- the instructions are up to date but maybe you had an outdated connection id? If recreating the catalog/.env file and just down && just up doesn't work, I'd also recommend running just init to ensure you have some data in the catalog & api dbs.

obulat

@obulat Some of the env variable names changed during development -- the instructions are up to date but maybe you had an outdated connection id? If recreating the catalog/.env file and just down && just up doesn't work, I'd also recommend running just init to ensure you have some data in the catalog & api dbs.

I checked and re-checked the env variables and the data_refresh pool, but apparently missed something. Ran the commands @sarayourfriend listed to recreate the .env file, and everything ran very well! Excited to start running refreshes again 🚀

sarayourfriend · 2023-06-19T19:31:42Z

@sarayourfriend FWIW #2352 made it so that you don't need to set up the data_refresh pool in local development, which I rebased this branch specifically to pick up because it's so annoying :)

!! Thanks for letting me know, I guess that makes the other issue moot 🙂

sarayourfriend · 2023-06-20T01:30:08Z

@stacimc I just realised that we probably need the same check in filtered index creation for newly created filtered indexes, right?

stacimc · 2023-06-20T22:27:13Z

@stacimc I just realised that we probably need the same check in filtered index creation for newly created filtered indexes, right?

Very good call! Done.

sarayourfriend

Nice!

stacimc added 🟧 priority: high Stalls work on the project or its dependents 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels May 27, 2023

stacimc requested review from a team as code owners May 27, 2023 00:29

stacimc self-assigned this May 27, 2023

stacimc requested review from obulat and dhruvkb May 27, 2023 00:29

github-actions bot added the 🧱 stack: api Related to the Django API label May 27, 2023

AetherUnbound approved these changes Jun 1, 2023

View reviewed changes

sarayourfriend requested changes Jun 4, 2023

View reviewed changes

sarayourfriend reviewed Jun 4, 2023

View reviewed changes

stacimc marked this pull request as draft June 9, 2023 00:02

stacimc force-pushed the add/api-healthcheck-to-data-refresh branch from 502a387 to 4725cb2 Compare June 15, 2023 23:27

stacimc changed the title ~~Add API healthcheck to data refresh~~ Add index readiness check to data refresh Jun 15, 2023

stacimc commented Jun 16, 2023

View reviewed changes

catalog/dags/common/constants.py Outdated Show resolved Hide resolved

stacimc marked this pull request as ready for review June 16, 2023 00:03

stacimc requested a review from AetherUnbound June 16, 2023 00:03

sarayourfriend reviewed Jun 16, 2023

View reviewed changes

catalog/dags/common/ingestion_server.py Outdated Show resolved Hide resolved

catalog/dags/common/ingestion_server.py Outdated Show resolved Hide resolved

catalog/tests/dags/common/test_ingestion_server.py Outdated Show resolved Hide resolved

sarayourfriend reviewed Jun 16, 2023

View reviewed changes

dhruvkb removed their request for review June 19, 2023 04:07

obulat approved these changes Jun 19, 2023

View reviewed changes

stacimc added 19 commits June 20, 2023 12:07

Add API connection

a4ce15f

Add web to allowed_hosts locally

d3a441a

Add customizable timeout for healthcheck

69700d7

Add healthcheck task

c716632

Authorize request

bd35186

Default variable

738840f

Template variable

f8e9a7a

Default templated variable

a22b31f

Query ES directly instead of the API for index readiness check

6ee3ddc

Remove unneeded changes

b6856a8

More realistic test scenario

d0dfc9e

Add ES connection

1282665

Remove unused variable

d5563a2

Query the index for expected results, instead of health check

98779d9

Remove log

90c0a49

Remove unused API connection

3200ddc

Clarify type of connection

6ba6813

Add more information to log

288e069

Add index check to filtered index

8ee2719

stacimc force-pushed the add/api-healthcheck-to-data-refresh branch from b67ff80 to 8ee2719 Compare June 20, 2023 22:03

sarayourfriend approved these changes Jun 20, 2023

View reviewed changes

stacimc merged commit 475f464 into main Jun 20, 2023

stacimc deleted the add/api-healthcheck-to-data-refresh branch June 20, 2023 22:32

AetherUnbound mentioned this pull request Jul 11, 2023

Unify DAG creation/database cleaning fixtures for testing #2622

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add index readiness check to data refresh #2209

Add index readiness check to data refresh #2209

stacimc commented May 27, 2023 •

edited

Loading

AetherUnbound commented Jun 1, 2023

AetherUnbound left a comment •

edited

Loading

AetherUnbound Jun 1, 2023

sarayourfriend left a comment •

edited

Loading

sarayourfriend left a comment

openverse-bot commented Jun 9, 2023

stacimc commented Jun 9, 2023

sarayourfriend commented Jun 9, 2023 •

edited

Loading

zackkrida commented Jun 12, 2023

sarayourfriend left a comment

sarayourfriend left a comment •

edited

Loading

stacimc commented Jun 16, 2023

AetherUnbound commented Jun 16, 2023 •

edited

Loading

stacimc commented Jun 19, 2023

obulat left a comment

sarayourfriend commented Jun 19, 2023

sarayourfriend commented Jun 20, 2023

stacimc commented Jun 20, 2023

sarayourfriend left a comment

	const refreshApiAccessToken = async (
	clientId: string,
	clientSecret: string
	) => {
	const formData = new URLSearchParams()
	formData.append("client_id", clientId)
	formData.append("client_secret", clientSecret)
	formData.append("grant_type", "client_credentials")

	const apiService = createApiService()
	try {
	const res = await apiService.post<TokenResponse>(
	"auth_tokens/token",
	formData
	)
	process.tokenData.accessToken = res.data.access_token
	process.tokenData.accessTokenExpiry = currTimestamp() + res.data.expires_in
	} catch (e) {
	/**
	* If an error occurs, serve the current request (and any pending)
	* anonymously and hope it works. By setting the expiry to 0 we queue
	* up another token fetch attempt for the next request.
	*/
	error("Unable to retrieve API token, clearing existing token", e)
	process.tokenData.accessToken = ""
	process.tokenData.accessTokenExpiry = 0
	;(e as AxiosError).message = `Unable to retrieve API token. ${
	(e as AxiosError).message
	}`
	throw e
	}
	}

Add index readiness check to data refresh #2209

Add index readiness check to data refresh #2209

Conversation

stacimc commented May 27, 2023 • edited Loading

Fixes

Description

Testing Instructions

Setup

Tests

Checklist

Developer Certificate of Origin

AetherUnbound commented Jun 1, 2023

AetherUnbound left a comment • edited Loading

Choose a reason for hiding this comment

AetherUnbound Jun 1, 2023

Choose a reason for hiding this comment

sarayourfriend left a comment • edited Loading

Choose a reason for hiding this comment

sarayourfriend left a comment

Choose a reason for hiding this comment

openverse-bot commented Jun 9, 2023

Footnotes

stacimc commented Jun 9, 2023

sarayourfriend commented Jun 9, 2023 • edited Loading

zackkrida commented Jun 12, 2023

sarayourfriend left a comment

Choose a reason for hiding this comment

sarayourfriend left a comment • edited Loading

Choose a reason for hiding this comment

stacimc commented Jun 16, 2023

AetherUnbound commented Jun 16, 2023 • edited Loading

stacimc commented Jun 19, 2023

obulat left a comment

Choose a reason for hiding this comment

sarayourfriend commented Jun 19, 2023

sarayourfriend commented Jun 20, 2023

stacimc commented Jun 20, 2023

sarayourfriend left a comment

Choose a reason for hiding this comment

stacimc commented May 27, 2023 •

edited

Loading

AetherUnbound left a comment •

edited

Loading

sarayourfriend left a comment •

edited

Loading

sarayourfriend commented Jun 9, 2023 •

edited

Loading

sarayourfriend left a comment •

edited

Loading

AetherUnbound commented Jun 16, 2023 •

edited

Loading