Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search relevancy sandbox #392

Closed
16 tasks done
Tracked by #2251
obulat opened this issue Feb 10, 2023 · 26 comments
Closed
16 tasks done
Tracked by #2251

Search relevancy sandbox #392

obulat opened this issue Feb 10, 2023 · 26 comments
Assignees
Labels
🧭 project: thread An issue used to track a project and its progress

Comments

@obulat
Copy link
Contributor

obulat commented Feb 10, 2023

Start Date ETA Project Lead Actual Ship Date
2023-04-01 2024-04-31 @AetherUnbound TBD

Description

Modify the API staging environment to include a proportional subset of media from each provider. Increase the frequency of data refreshes.

This project does not address metrics for measuring and tracking relevancy.

To implement this project, we want to read the production /stats/ endpoints to get the media totals for each provider, then scale these numbers down to set the ingestion limit per provider in staging.

This project could also include the update of the Elasticsearch cluster (or setting up a new cluster and moving the staging there).

Documents

Project homepage

Issues

Milestones

Prior Art

@obulat obulat added the 🧭 project: thread An issue used to track a project and its progress label Feb 10, 2023
@obulat obulat moved this to Not Started in Openverse Project Tracker Feb 16, 2023
@AetherUnbound AetherUnbound moved this from Not Started to In Kickoff in Openverse Project Tracker Mar 28, 2023
@AetherUnbound
Copy link
Collaborator

Update 2023-04-12

Done

Next

Blockers

@AetherUnbound
Copy link
Collaborator

AetherUnbound commented May 2, 2023

Update 2023-05-02

Done

Next

  • Create the issues associated with the work described in Implementation Plan: Update staging database #1154
  • Implement the issues created above
  • Implementation plan for the rapid iteration on ingestion server index configuration
  • Implementation plan for the Staging Elasticsearch reindex DAGs for both potential index types

@AetherUnbound
Copy link
Collaborator

Both the rapid iteration IP (#1985) and the staging database recreation DAG (#1989) are underway. Most of the implementation for the DAG is complete, with #2207 and #2211 being the final two pieces before that can be shipped. We'll also need to complete #1990 before we turn the DAG on for regular use.

@krysal krysal moved this from In RFC to In Progress in Openverse Project Tracker Jun 7, 2023
@AetherUnbound
Copy link
Collaborator

The staging database restore DAG is complete! 🥳 I will enable it for the first time once I'm back from WCEU next week.

@AetherUnbound
Copy link
Collaborator

AetherUnbound commented Jun 9, 2023

The implementation plan for the rapid iteration (#2133) has been merged and the associated issues (#2370, #2371, #2372) have been created. This work is necessary to perform serially but it can begin immediately!

The initial implementation plan for the 3rd portion of this project has also been published by @krysal in #2358

@openverse-bot
Copy link
Collaborator

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@AetherUnbound
Copy link
Collaborator

The staging database restore DAG work was completed and was enabled for the first time this week! While it completed successfully, @sarayourfriend encountered some trouble with the Terraform state which was referencing the RDS instance (in that the instance appeared to Terraform to be removed, even though a new instance was spun up). Sara corrected the state file and we believe this issue is resolved, but we'll need to monitor the next run to see if we encounter the same issue and adjust either Terraform or the DAG accordingly.

@krysal opened the final IP for this work in #2358, and we've begun discussing it there.

@sarayourfriend
Copy link
Collaborator

but we'll need to monitor the next run to see if we encounter the same issue and adjust either Terraform or the DAG accordingly.

After the subsequent runs the changes in Terraform work as expected, no further work here is needed.

@AetherUnbound AetherUnbound moved this from In Progress to On Hold in Openverse Project Tracker Jul 5, 2023
@AetherUnbound
Copy link
Collaborator

Per the priorities meeting discussion that happened around this project, we'll be putting this project on hold for now. We'll continue the review process to merge #2358, but hold off on any ongoing development efforts in favor of other ongoing projects.

@AetherUnbound AetherUnbound moved this from On Hold to In Progress in Openverse Project Tracker Oct 4, 2023
@openverse-bot
Copy link
Collaborator

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@AetherUnbound
Copy link
Collaborator

We've just pulled this project off of on-hold - the Elasticsearch rapid iteration milestone issues can be worked on once we address the API stability!

@openverse-bot
Copy link
Collaborator

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@AetherUnbound
Copy link
Collaborator

The work on this project has been slower due to our current effort to focus on #3197. With that work slowing down, we should be able to start working on this project again this cycle.

@openverse-bot
Copy link
Collaborator

Hi @AetherUnbound, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@AetherUnbound
Copy link
Collaborator

The only update is that we now have a PR open for creating the staging indices: #3232

@stacimc
Copy link
Collaborator

stacimc commented Nov 17, 2023

While working on #3232, I went back and read the implementation plans for this project and came away with a lot of questions. It's possible some of this may be answered in comment threads on the IP prs that didn't make it into the final text, so apologies if I'm rehashing anything!

Here's a really brief summary of my understanding of the three IPs involved in this project, and the DAGs they describe:

  • Update staging DB
    • staging_database_restore DAG
      • this updates the staging DB using the most recent production db snapshot. It does not affect es indices.
  • Rapid iteration of ingestion server index configuration
    • create_new_es_index_staging
      • TL;DR: creates a brand new es index, with lots of options for configuration, but does not actually promote the index/apply any aliases. (Question about this below)
        • Longer explanation: For the given media_type, creates a new staging elasticsearch index with the given index-suffix, derived from the given source_index, optionally filtering documents from the source using a given query. It also takes an index-config which is merged with the ES configuration of the source index (or totally overwrites it, if the override_config param is enabled), allowing you to change the shape of the index on the fly.
    • create_new_es_index_production
      • The exact same thing as above, but for production
  • Staging Elasticsearch Reindex DAGs
    • recreate_full_staging_index
      • Creates a new es index in staging based on the <media_type> index, and points the <media-type>-full alias to it. If the point_alias param is enabled, it will instead point it to the main <media_type> alias. It also has a param that’s used to decide whether to delete the index being replaced.
    • create_proportional_by_provider_staging_index
      • Creates a new es index in staging based on the given source_index, but with a proportional subset of records for each provider, given by the percentage_of_prod param.

So here are my questions:

  1. We just updated the data refresh to create the filtered index before promoting the new media index, because we observed performance issues during filtered index creation (since it was running a reindex on a live, actively used index). How will the create_new_es_index_production DAGs account for this? They currently create indices from the main es index.
  2. The create_new_es_index_{environment} dags create the indices, but don’t promote them/point any aliases. Shouldn't they? What is the plan for how to promote those indices — is that done manually, or in a separate dag not yet planned?
    1. Could we not have a couple params:
      1. target_alias: target alias to apply on new index. If empty, no alias is pointed
      2. delete_old_index: whether to delete an index that is being replace by this one, if applicable (i.e., the index previously pointed to by the target alias)
  3. What is the purpose of the recreate_full_staging_index DAGs? The create_new_es_index_staging does the same thing but also has many more configuration options. The exception is the option to promote/delete old indices, which I argue should be added to that DAG anyway. This seems like duplicated code to have to maintain.
  4. I do think it's a good idea to separate out the create_proportional_by_provider_staging_index DAG (rather than add a the percentage_of_prod param to the create_new_es_index_{environment} DAGs, especially since that param doesn't make sense in prod. But it seems like we could intentionally share a lot of logic around creating the index and the promotion steps (only changing the logic for the actual reindexing).
  5. Should we add an issue to create a staging_data_refresh DAG to run data refreshes on the staging ingestion server/against the staging api? That seems like it would be a really helpful final piece for testing (and easy to implement).

@AetherUnbound
Copy link
Collaborator

I'll try my best to answer the above, @krysal may also be able to provide some additional context!

  1. We just updated the data refresh to create the filtered index before promoting the new media index, because we observed performance issues during filtered index creation (since it was running a reindex on a live, actively used index). How will the create_new_es_index_production DAGs account for this? They currently create indices from the main es index.

Part of that move was predicated by the API response time investigation project, and the motivation was as you mention to reduce pressure on the live indices while the data refresh is happening. Since we were able to reduce response times through a combination of related media query simplification and API ASGI implementation, I think we should be safe to target live indices for this DAG in the future. With the frequency of the data refresh (and the ease with which we could change the order of operations for it!), it made sense to make that change. For the more on-the-fly index generation that this new DAG was intended to enable, I'm not sure how we'd get around using the live indices (unless we decided to create a whole new index in order to create the index from that, in which case we should just create the index from the database with the new settings).

All that to say that I think with how infrequently this particular DAG is likely to run and how improved our Elasticsearch response time is, I think this should be a safe thing to do now 🙂

  1. The create_new_es_index_{environment} dags create the indices, but don’t promote them/point any aliases. Shouldn't they? What is the plan for how to promote those indices — is that done manually, or in a separate dag not yet planned?

That's a great point, I like the idea of adding those new configuration options!

  1. What is the purpose of the recreate_full_staging_index DAGs? The create_new_es_index_staging does the same thing but also has many more configuration options. The exception is the option to promote/delete old indices, which I argue should be added to that DAG anyway. This seems like duplicated code to have to maintain.

As I understand it, create_new_es_index_* only uses an existing index, whereas recreate_full_staging_index should pull records from the database and insert them into a new index, similar to the data refresh. I believe the same holds for the proportional DAG. The former is derrived while the latter two are created fresh. With that in mind, I think it makes sense to keep them separately in my mind.

  1. I do think it's a good idea to separate out the create_proportional_by_provider_staging_index DAG (rather than add a the percentage_of_prod param to the create_new_es_index_{environment} DAGs, especially since that param doesn't make sense in prod. But it seems like we could intentionally share a lot of logic around creating the index and the promotion steps (only changing the logic for the actual reindexing).

@sarayourfriend also mentioned this in the third IP. I also support having separate DAGs for these but as we're developing them we should keep reuse in mind!

  1. Should we add an issue to create a staging_data_refresh DAG to run data refreshes on the staging ingestion server/against the staging api? That seems like it would be a really helpful final piece for testing (and easy to implement).

That could be useful! Especially for testing changes to the data refresh process. I think one thing we'd want to add on an infrastructure level is an official DNS name to the staging data refresh server, because currently we only reference it by IP which would change every deployment.


Hopefully that helps! I'd love some other folks to weigh in, but I think it might make sense to take some of these changes back to the IPs and update them appropriately once we've desided 😄

@krysal
Copy link
Member

krysal commented Nov 21, 2023

@stacimc You summarized the parts of these plans very well. The questions are quite valid, and I pretty much agree with the answers that @AetherUnbound has already given, I'll just add a few comments.

What is the purpose of the recreate_full_staging_index DAGs? The create_new_es_index_staging does the same thing but also has many more configuration options. The exception is the option to promote/delete old indices, which I argue should be added to that DAG anyway. This seems like duplicated code to have to maintain.

As I understand it, create_new_es_index_* only uses an existing index, whereas recreate_full_staging_index should pull records from the database and insert them into a new index, similar to the data refresh. I believe the same holds for the proportional DAG. The former is derrived while the latter two are created fresh. With that in mind, I think it makes sense to keep them separately in my mind.

Exactly! The recreate_full_staging_index is planned mainly to decouple the full index creation from the Data Refresh process, and to use the resulting index as the source for the create_proportional_by_provider_staging_index and the create_new_es_index_{environment} DAGs.

Given the complexity of this project at the beginning, it was decided to split the different forms for creating the indexes (by type, environment, and size with custom configurations) in different IPs, and therefore, in different DAGs, it was more understandable and manageable this way. Now that the requirements are clearer and there is more familiarity with ES we can see opportunities to merge the DAGs, but I'll wait to build at least some of them before doing so, because the many moving parts are still there, and will be easy to test them separately, although that's my opinion! Maybe others see it differently. I also believe the same DAGs should probably apply to both environments at the end of the day.

@stacimc
Copy link
Collaborator

stacimc commented Nov 28, 2023

Thanks @krysal and @AetherUnbound, that additional context makes a ton of sense :)

The recreate_full_staging_index is planned mainly to decouple the full index creation from the Data Refresh process, and to use the resulting index as the source for the create_proportional_by_provider_staging_index and the create_new_es_index_{environment} DAGs.

That's a great summary, totally clicked for me. I think the underlying IPs are great, but I was stuck on a higher-level picture of how these DAGs are practically going to be used and relate to one another. That's actually sort of split out into a separate, future project 😅 For now, it makes a lot of sense to just be really flexible in the implementation, as already described :)

@stacimc
Copy link
Collaborator

stacimc commented Feb 7, 2024

Update:

I've opened a PR for the proportional index DAG which is almost ready, mostly needing a final test and then description/testing instructions.

This PR also happens to tackle much of the work needed for this issue to add a DAG for pointing aliases; once it's merged that issue can be resolved very quickly.

While implementing the proportional DAG I did notice a problem for which I filed a bug (#3761). In looking at that now I actually think it might've been easier to implement the fixed version than the original idea, so I may update the proportional index DAG PR to include that fix as well.

Summary: there are only a few issues left in the milestone, the largest of which is almost fully implemented. The remaining issues should be tackled in the next week or so.

@stacimc
Copy link
Collaborator

stacimc commented Mar 6, 2024

The proportional index DAG was merged after awhile in code review, including the fix for the #3761 bug that was filed while working on it! All that remains is the point alias DAG, which is in progress and a much smaller issue.

@stacimc
Copy link
Collaborator

stacimc commented Mar 11, 2024

A PR is open for the point alias DAG.

While working on it I noticed that a few of the concurrency dependencies between all the related elasticsearch DAGs had not been set up properly, or noticed in review. That is a huge pain point of having all these DAGs and very easy to miss. I prototyped an idea for handling that in a more automated way and created an issue (#3891).

At minimum the dependencies need to be fixed; ideally, that issue is implemented and this problem is solved into the future. My plan is to timebox the larger issue and fall back to simply manually fixing the dependencies if there are any issues with the more complex approach.

@stacimc
Copy link
Collaborator

stacimc commented Mar 26, 2024

The final PR for this milestone has been merged! Moving this project to Shipped.

@stacimc stacimc moved this from 🚧 In Progress to 🚢 Shipped in Openverse Project Tracker Mar 26, 2024
@stacimc
Copy link
Collaborator

stacimc commented Mar 26, 2024

Per the project's stated success criteria, which is:

This project can be considered a success once any maintainer can make
adjustments to the staging API database & Elasticsearch index based on the
requirements described above.

I think this could be considered a success. The process docs note that a project-specific retro could be held at this stage as well, but feedback related to this project has already been discussed at the two most recent retros.

I would like to propose that this project be moved to Success and the project thread closed, given this. @WordPress/openverse-maintainers, what do you think (and maybe specifically @AetherUnbound as the author of the proposal)? I can also wait to raise this question at the community meeting if preferred!

@AetherUnbound
Copy link
Collaborator

I would be comfortable moving this to Success!

@stacimc
Copy link
Collaborator

stacimc commented Mar 27, 2024

Excellent! Moving to Success 🎉

@stacimc stacimc moved this from 🚢 Shipped to ✅ Success in Openverse Project Tracker Mar 27, 2024
@stacimc stacimc closed this as completed Mar 27, 2024
@github-project-automation github-project-automation bot moved this from ✅ Success to 🚢 Shipped in Openverse Project Tracker Mar 27, 2024
@AetherUnbound AetherUnbound moved this from 🚢 Shipped to ✅ Success in Openverse Project Tracker Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🧭 project: thread An issue used to track a project and its progress
Projects
Status: ✅ Success
Development

No branches or pull requests

6 participants