-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create filtered index before promoting primary index during data refresh #3303
Create filtered index before promoting primary index during data refresh #3303
Conversation
Unfortunately if we don't do this, the task depdendency graph looks very messy, as a dependency line is drawn from generate_index_suffix to every descendant task that uses it.
Full-stack documentation: https://docs.openverse.org/_preview/3303 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. Changed files 🔄: |
conditions`` section below). | ||
|
||
The DAGs generated in this module are on a `None` schedule and are only triggered | ||
manually. This is primarily useful in two cases: for testing changes to the filtered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to have this functionality!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested following the instructions, and the steps work well. Thank you for excellent documentation, as always, @stacimc!
Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR: @AetherUnbound Excluding weekend1 days, this PR was ready for review 4 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2. @stacimc, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is excellent 🤩 I tested the DAGs locally and they run great, I'm glad to have the functionality all encapsulated within one DAG now rather than have it spread across too, while maintaining the filtered index DAG in case we need it. I have a few comments but no blocking objections - awesome work!
Fixes
Fixes #2981 by @AetherUnbound
Description
This PR updates the data refresh DAG to run the steps to create the filtered index before promoting the primary index . The goal is to prevent API instability during data refreshes, potentially caused by the API using an ES index that is also being used as the source of a reindex (to create the filtered index).
To accomplish this, I refactored the
create_filtered_index
DAG factory to add a new factory that generates acreate_filtered_index
TaskGroup and a separatepromote_filtered_index
TaskGroup (that does the index readiness check, promotion, and deletion). The data refresh DAGs now, rather than triggering the separatecreate_filtered_index
DAGs, use this factory method to create the taskgroups and add them directly into its own flow:Collapsed view, showing that we complete
ingest_upstream
, wait for the index readiness check on the primary index, then create the filtered index before promoting anything, and only then do we promote the primary index followed by the filtered index.Expanded view of the filtered index creation
Expanded view of the filtered index promotion
We keep the separate filtered index creation DAGs as these are still useful for running manually when testing filtered indices or updating the sensitive terms list. These DAGs now just reuse the same task groups and add the concurrency check. Because the DAG is only run manually, I removed the
force
parameter to break through the concurrency check.New view of the create filtered index DAG
Testing Instructions
Test the data refreshes locally. Also test the filtered index DAGs: I tried running the DAG locally with
origin_index_suffix
empty (so it would use the current index) anddestination_index_suffix
set tofoo
. After the DAG completed I checked Elasticvue to see that I now had anaudio-foo-filtered
index replacing the old filtered index :)Test that the concurrency checks work to prevent either a data refresh or filtered index dag from running at the same time as its counterpart. You can do this by triggering both at once in the shell:
Try swapping the order of the two commands as well to test the opposite direction. (If the filtered index Dag starts second, it should fail immediately. If the data refresh starts second, it should wait for the filtered index dag. This behavior was not changed.)
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin