Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create filtered index before promoting primary index during data refresh #3303

Merged

Conversation

stacimc
Copy link
Collaborator

@stacimc stacimc commented Nov 1, 2023

Fixes

Fixes #2981 by @AetherUnbound

Description

This PR updates the data refresh DAG to run the steps to create the filtered index before promoting the primary index . The goal is to prevent API instability during data refreshes, potentially caused by the API using an ES index that is also being used as the source of a reindex (to create the filtered index).

To accomplish this, I refactored the create_filtered_index DAG factory to add a new factory that generates a create_filtered_index TaskGroup and a separate promote_filtered_index TaskGroup (that does the index readiness check, promotion, and deletion). The data refresh DAGs now, rather than triggering the separate create_filtered_index DAGs, use this factory method to create the taskgroups and add them directly into its own flow:

Screenshot 2023-11-01 at 2 23 31 PM

Collapsed view, showing that we complete ingest_upstream, wait for the index readiness check on the primary index, then create the filtered index before promoting anything, and only then do we promote the primary index followed by the filtered index.

Screenshot 2023-11-01 at 2 23 52 PM

Expanded view of the filtered index creation

Screenshot 2023-11-01 at 2 24 06 PM

Expanded view of the filtered index promotion

We keep the separate filtered index creation DAGs as these are still useful for running manually when testing filtered indices or updating the sensitive terms list. These DAGs now just reuse the same task groups and add the concurrency check. Because the DAG is only run manually, I removed the force parameter to break through the concurrency check.

Screenshot 2023-11-01 at 2 24 38 PM

New view of the create filtered index DAG

Testing Instructions

Test the data refreshes locally. Also test the filtered index DAGs: I tried running the DAG locally with origin_index_suffix empty (so it would use the current index) and destination_index_suffix set to foo. After the DAG completed I checked Elasticvue to see that I now had an audio-foo-filtered index replacing the old filtered index :)

Test that the concurrency checks work to prevent either a data refresh or filtered index dag from running at the same time as its counterpart. You can do this by triggering both at once in the shell:

$ just catalog/shell
> airflow dags trigger image_data_refresh && airflow dags trigger create_filtered_image_index

Try swapping the order of the two commands as well to test the opposite direction. (If the filtered index Dag starts second, it should fail immediately. If the data refresh starts second, it should wait for the filtered index dag. This behavior was not changed.)

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

Unfortunately if we don't do this, the task depdendency graph looks very messy, as a
dependency line is drawn from generate_index_suffix to every descendant task that
uses it.
@stacimc stacimc added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Nov 1, 2023
@stacimc stacimc self-assigned this Nov 1, 2023
@stacimc stacimc requested review from a team as code owners November 1, 2023 22:05
@stacimc stacimc requested review from fcoveram, AetherUnbound and obulat and removed request for fcoveram November 1, 2023 22:05
@github-actions github-actions bot added the 🧱 stack: documentation Related to Sphinx documentation label Nov 1, 2023
Copy link

github-actions bot commented Nov 1, 2023

Full-stack documentation: https://docs.openverse.org/_preview/3303

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

Changed files 🔄:

conditions`` section below).

The DAGs generated in this module are on a `None` schedule and are only triggered
manually. This is primarily useful in two cases: for testing changes to the filtered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have this functionality!

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested following the instructions, and the steps work well. Thank you for excellent documentation, as always, @stacimc!

@openverse-bot
Copy link
Collaborator

Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR:

@AetherUnbound
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 4 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2.

@stacimc, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent 🤩 I tested the DAGs locally and they run great, I'm glad to have the functionality all encapsulated within one DAG now rather than have it spread across too, while maintaining the filtered index DAG in case we need it. I have a few comments but no blocking objections - awesome work!

catalog/dags/data_refresh/create_filtered_index.py Outdated Show resolved Hide resolved
catalog/dags/data_refresh/create_filtered_index.py Outdated Show resolved Hide resolved
@stacimc stacimc merged commit 8b98f00 into main Nov 10, 2023
41 checks passed
@stacimc stacimc deleted the update/create-filtered-index-before-promoting-primary-index branch November 10, 2023 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: documentation Related to Sphinx documentation
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Wait to change all ES aliases until filtered index creation is complete
4 participants