-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add index readiness check to data refresh #2209
Conversation
I'm testing this out locally right now, but wanted to surface something - the Oauth tokens we receive from the API aren't indefinite, right? I think they have an expiration time (my local one was |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM! I was able to run all the testing instructions locally successfully. I have a few comments but nothing blocking. I do, however, think that the token piece will require a follow up PR for managing the Oauth registration.
If an access token is invalid or expired querying the media endpoints using
internal__index
will not throw an error: it will silently query against the live table. This could obviously be misleading in the health check.
This seems like behavior we'd want to change on the API; returning a 400 would probably be best in the case that a token is invalid for that particular parameter I think, especially given that the token we have will eventually expire.
poke_interval=poke_interval, | ||
timeout=timeout.total_seconds(), | ||
retries=retries, | ||
retry_exponential_backoff=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL, very neat!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be wrong about this, but I think the current implementation incorrectly uses the API token/misses an important implementation detail.
To make authenticated requests to the API, a new token must be retrieved using the API credentials. These tokens expire after a set period of time. See the response body documented in https://api.openverse.engineering/v1/#tag/auth/operation/token which includes expires_in
. The token must be refreshed by making a new API token request each time. Even our own frontend does this token refresh dance:
openverse/frontend/src/plugins/api-token.server.ts
Lines 71 to 102 in 697f62f
const refreshApiAccessToken = async ( | |
clientId: string, | |
clientSecret: string | |
) => { | |
const formData = new URLSearchParams() | |
formData.append("client_id", clientId) | |
formData.append("client_secret", clientSecret) | |
formData.append("grant_type", "client_credentials") | |
const apiService = createApiService() | |
try { | |
const res = await apiService.post<TokenResponse>( | |
"auth_tokens/token", | |
formData | |
) | |
process.tokenData.accessToken = res.data.access_token | |
process.tokenData.accessTokenExpiry = currTimestamp() + res.data.expires_in | |
} catch (e) { | |
/** | |
* If an error occurs, serve the current request (and any pending) | |
* anonymously and hope it works. By setting the expiry to 0 we queue | |
* up another token fetch attempt for the next request. | |
*/ | |
error("Unable to retrieve API token, clearing existing token", e) | |
process.tokenData.accessToken = "" | |
process.tokenData.accessTokenExpiry = 0 | |
;(e as AxiosError).message = `Unable to retrieve API token. ${ | |
(e as AxiosError).message | |
}` | |
throw e | |
} | |
} |
Based on what I've seen of how the AwsHook works from the AWS provider, I think a new Hook is the right way to manage that kind of thing. The hook would need the client ID and secret in order to make the token refresh request. Ideally the newest token would be stored somewhere in the database (or shared memory? Redis?) and reused if possible or refreshed before a request is sent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions:
- Is it possible to derive index "readiness" from the Elasticsearch API? If so, it could be a more reliable and technically "correct" indicator than a rather generic threshold of responses to expect. This approach works, but it also required complicating the API by adding the
internal__index
param, which, if not needed, would be nice to remove and reduce the API complexity. Though, I don't know if that parameter is only needed for this, or if this is just one of several use cases for it. - At a glance, I think these changes don't make Delay origin index alias promotion until after filtered index is created #2314 more complicated. Can you double-check to make sure that is the case? These changes, if anything, probably make it slightly easier, but I could be wrong about that and just want to have it in mind so as not to create and inadvertent snags for the other issue down the line.
Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR: @obulat Excluding weekend1 days, this PR was ready for review 9 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2. @stacimc, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
Drafting to address concerns. I noted the issue with the token refresh and was hoping there was some kind of lever available to us already for bypassing that internally, or else that automation could be moved to a follow up :/ We can definitely generate/refresh the tokens in Airflow -- the existing oauth2 dags are a great place to start. Noting though that this is blocking the image data refresh in the meantime.
Really good question @sarayourfriend -- I'm not sure! I had not considered alternatives to the proposed implementation. I believe that is the only intended use of We could query the elasticsearch index directly instead of going through the API -- this would eliminate the need for the Also, doing a bit of research, I'm curious if what we need in the first place is to add the |
The workers should ping We shouldn't refresh during the bulk insert because the cost of each refresh can be offset until all documents are inserted and indexed (i.e., when It could be that somehow the Querying ES directly sounds like an excellent alternative to avoid all the API auth issues though 👍 There is no workaround for this on the API side and I don't think it would be correct to implement tokens with infinite life times. The potential for accidental misuse is too great. |
A small question here: is health_check appropriate or does it cause too much confusion with our existing service-level healthchecks for the API, frontend, etc? Would "index readiness check" or something potentially be more appropriate? |
502a387
to
4725cb2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API related environment variables can be removed now, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested this locally and it works great, including the failure states.
One question: are we confident that index "readiness" can be sufficiently deduced from cluster health? I can't remember from the original incident whether the cluster was reporting green health even while it wasn't returning any search results.
I was also curious if the cluster would go yellow at the expected time, but to test that I had to have a green cluster locally. I adapted this docker-compose ES configuration to get a three node cluster running locally with the diff below (collapsed to save space):
Multi-node local ES cluster with green health
diff --git a/docker-compose.yml b/docker-compose.yml
index 3304ac042..36d0b0bc1 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -177,30 +177,65 @@ services:
ports:
- "50263:6379"
- es:
- profiles:
- - ingestion_server
- - api
+ es01:
+ networks:
+ default:
+ aliases:
+ - es
image: docker.elastic.co/elasticsearch/elasticsearch:7.12.0
+ container_name: es01
+ environment:
+ - node.name=es01
+ - cluster.name=openverse-local
+ - discovery.seed_hosts=es02,es03
+ - cluster.initial_master_nodes=es01,es02,es03
+ - bootstrap.memory_lock=true
+ - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
+ depends_on:
+ - es02
+ - es03
+ ulimits:
+ memlock:
+ soft: -1
+ hard: -1
+ volumes:
+ - es-data01:/usr/share/elasticsearch/data
ports:
- "50292:9200"
- env_file:
- - docker/es/env.docker
- healthcheck:
- test:
- [
- "CMD-SHELL",
- "curl -si -XGET 'localhost:9200/_cluster/health?pretty' | grep -qE 'yellow|green'",
- ]
- interval: 10s
- timeout: 60s
- retries: 10
+
+ es02:
+ image: docker.elastic.co/elasticsearch/elasticsearch:7.12.0
+ container_name: es02
+ environment:
+ - node.name=es02
+ - cluster.name=openverse-local
+ - discovery.seed_hosts=es01,es03
+ - cluster.initial_master_nodes=es01,es02,es03
+ - bootstrap.memory_lock=true
+ - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
ulimits:
- nofile:
- soft: 65536
- hard: 65536
+ memlock:
+ soft: -1
+ hard: -1
+ volumes:
+ - es-data02:/usr/share/elasticsearch/data
+
+ es03:
+ image: docker.elastic.co/elasticsearch/elasticsearch:7.12.0
+ container_name: es03
+ environment:
+ - node.name=es03
+ - cluster.name=openverse-local
+ - discovery.seed_hosts=es01,es02
+ - cluster.initial_master_nodes=es01,es02,es03
+ - bootstrap.memory_lock=true
+ - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
+ ulimits:
+ memlock:
+ soft: -1
+ hard: -1
volumes:
- - es-data:/usr/share/elasticsearch/data
+ - es-data03:/usr/share/elasticsearch/data
web:
profiles:
@@ -219,7 +254,7 @@ services:
- "50230:3000" # Sphinx (unused by default; see `sphinx-live` recipe)
depends_on:
- db
- - es
+ - es01
- cache
env_file:
- api/env.docker
@@ -246,7 +281,7 @@ services:
depends_on:
- db
- upstream_db
- - es
+ - es01
- indexer_worker
volumes:
- ./ingestion_server:/ingestion_server
@@ -272,7 +307,7 @@ services:
depends_on:
- db
- upstream_db
- - es
+ - es01
volumes:
- ./ingestion_server:/ingestion_server
env_file:
@@ -300,6 +335,8 @@ volumes:
catalog-postgres:
plausible-postgres:
plausible-clickhouse:
- es-data:
+ es-data01:
+ es-data02:
+ es-data03:
minio:
catalog-cache:
When I run the reindex, it does go yellow at the time we expect here.
So my only question is whether it's still worth querying the new index directly to see if it reports hits matching/exceeding the threshold.
Also, as a follow-up to this change, are we able to remove exact_index
after this or do we need to keep it for anything else?
I'll approve the questions are cleared up and unneeded environment variables etc are cleaned up (if we don't need them anymore).
I was under this impression from the incident report and this seems like it should be the case from the elasticsearch documentation. However I went back through Slack just now to triple check and saw a comment suggesting the cluster was green when it was returning no documents 😱 I've updated this to check for hits from the index instead. I don't think it's necessary to keep the cluster health check as well, if it indeed cannot be relied upon.
I believe we can and I intend to open that as a follow-up PR, yes. |
It seems this needs the |
@sarayourfriend FWIW #2352 made it so that you don't need to set up the @obulat Some of the env variable names changed during development -- the instructions are up to date but maybe you had an outdated connection id? If recreating the catalog/.env file and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@obulat Some of the env variable names changed during development -- the instructions are up to date but maybe you had an outdated connection id? If recreating the catalog/.env file and
just down && just up
doesn't work, I'd also recommend runningjust init
to ensure you have some data in the catalog & api dbs.
I checked and re-checked the env variables and the data_refresh
pool, but apparently missed something. Ran the commands @sarayourfriend listed to recreate the .env
file, and everything ran very well! Excited to start running refreshes again 🚀
!! Thanks for letting me know, I guess that makes the other issue moot 🙂 |
@stacimc I just realised that we probably need the same check in filtered index creation for newly created filtered indexes, right? |
b67ff80
to
8ee2719
Compare
Very good call! Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Fixes
Fixes #2071 by @dhruvkb
Description
Adds an
index_readiness_check
task to the data refresh DAGs (including thecreate_filtered_<media>_index
DAGs), used to detect issues where the index is ready, but the API is not connecting properly and returning results. The task is an HttpSensor which runs afteringest_upstream
completes (creating the new es index), andqueries the API using the newdirectly queries the new ES index, passing only when we retrieve a threshold number ofinternal__index
param to target the new indexhits
. Once it passes, the DAG is able to move on to thepromote
steps and actually promote the indices in production.Only the data refresh TaskGroup is shown.
The Sensor times out in 24 hours, so the entire DAG will fail (without promoting the new index) and notify in Slack if the ES index does not become ready within that timeframe. After manual intervention, the DAG can be safely manually resumed from this point.
Testing Instructions
Setup
Add
ES_INDEX_READINESS_RECORD_COUNT=1000
to yourcatalog/.env
. This is used in local testing to lower the threshold record count, since local test data does not have as many records.Set up the ES connection by adding the following to
catalog/.env
:AIRFLOW_CONN_ELASTICSEARCH_HTTP_PRODUCTION=http://es:9200
Run
just down && just up
Enable the
create_filtered_audio_index
andcreate_filtered_image_index
DAGsTests
ES_INDEX_READINESS_RECORD_COUNT
record to a larger number (100_000
). Then run the DAG and observe that theindex_readiness_check
continuously goesup_for_reschedule
and does not pass.index_readiness_check
task to query with an incorrect index name. Manually edit the task here:Re-run the DAG and observe that the task continuously goes
up_for_reschedule
and does not pass.Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin