-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation Plan: Staging Elasticsearch Reindex DAGs #2358
Conversation
720ae1e
to
a65144a
Compare
54e14d4
to
d6cf421
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM so far, I left some clarifying questions, some suggestions, and some potential alternatives. But this looks great so far 🙂
.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md
Outdated
Show resolved
Hide resolved
.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md
Outdated
Show resolved
Hide resolved
.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md
Outdated
Show resolved
Hide resolved
.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md
Outdated
Show resolved
Hide resolved
.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md
Outdated
Show resolved
Hide resolved
"query": { | ||
"term": { | ||
"source.keyword": "stocksnap" | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried this locally to see how the resulting query orders the documents? For these proportional indexes, I wonder if using the random score function would help us get a statistically significant sample 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try it locally but didn't pay attention to the order, I didn't think it was important. How would you define a "statistically significant sample" for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
statistically significant sample
This isn't the term I should have used. More accurately would be "more diverse sample". If an individual provider's document's scores are consistent, then we'd be testing the same 10% (or whatever proportion we use) each time the reindex happens. Using a random score would make it more likely to have a diverse set of document configurations (file types, sizes, etc). It may not be an issue if the default scores are already sufficiently random and distribute attribute variation proportionally. Random score would theoretically make it more likely for the subset of documents to be representative of the whole if the default scores are uniform (which I don't know).
.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md
Outdated
Show resolved
Hide resolved
.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md
Outdated
Show resolved
Hide resolved
.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md
Show resolved
Hide resolved
Elasticsearch does not impose any limit on the amount of indices one can create | ||
but naturally they come with a cost. We don't have policies for creating or | ||
deleting indices for the time being so we should monitor if we reach a point | ||
where having many indexes impact the cluster performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a disk usage monitor for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking more about a scenario where having many indices degrades the overall cluster response time. I read something around these lines in the Elastic forum. But that is a good suggestion as well. I believe something similar is included in the ECS alarms proposal, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EC2 instances need the same treatment that the catalogue got to add the CloudWatch agent to get disk space reporting: https://github.com/WordPress/openverse-infrastructure/pull/455
The bigger question, I suppose, is how we would measure/notice degraded staging performance, or if we would need to at all. The worst case is: we notice performance degrades through regular usage and then delete some indexes. Do we need any special monitoring at all, in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume not, given that "regular usage" for staging is primarily our own interactions with it rather than direct user interactions.
I will be looking at this IP this week! Sorry for the delay due to my AFK. |
@krysal -- I'm happy to review, but just clarifying whether I'm a required reviewer on this, as I'm not named in the PR? |
I'll address @sarayourfriend's comments today! Sorry for the delay, other urgent matters took precedence in my calendar these days. @AetherUnbound No worries at all! @stacimc Yours was an automated assignation (probably because Sara isn't in the @WordPress/openverse-catalog group) but you're welcome to comment as well if you want. I am sure your ideas will improve the proposed plan :) |
d6cf421
to
b7900a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for drafting this Krystle, and apologies it took me so long to review! I'm in agreement with Sara on a few points of clarification, along with some additional explicit steps being added.
.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md
Outdated
Show resolved
Hide resolved
.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md
Outdated
Show resolved
Hide resolved
.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md
Outdated
Show resolved
Hide resolved
|
||
```json | ||
{ | ||
"max_docs": num_items, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, this rocks!
.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md
Outdated
Show resolved
Hide resolved
.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md
Show resolved
Hide resolved
0894743
to
37b07f1
Compare
Full-stack documentation: https://docs.openverse.org/_preview/2358 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. |
Ah, that's a really important point you've raised @sarayourfriend. I don't think any of our current plans cover the filtered indices in staging actually 😮 That might be because both this project and the filtering of results on the API were being proposed alongside each other, so the latter hadn't been finalized while we were planning out the former. As it stands, I don't believe we have an explicit mechanism defined for creating the filtered index on staging. Given that this is not production though, perhaps it would make sense to filter the results while building the We've had a lot of discussion about this IP, I wonder if it might make sense to merge it as-is given it's quite close, with a fast-follow clarification about how we'd manage the filtering (given we won't be actually working on the implementation of this for a little bit since the project is on hold). Would you prefer that Krystle? And as for the automatic deletion, those are great points. I don't think we need to incorporate that aspect of it in the proposal here - we can leave it as an MSR task or something similar! |
@AetherUnbound The current way of building the filtered index is using the ES's Reindex API, it's not exactly a direct process form from the database. That would require creating a new process but I think we can just trigger the DAG for creating the filtered index as an additional step for building the
Agree on merging this, as the filtered index is a requirement that was not initially included. I'll change the default source to the filtered index for the subset-by-provider DAG, as it's an easy change. Thank you both for the excellent suggestions. |
Sounds good! Yes unfortunately the filtered index DAG is only enabled for production, so we'd need to either have a similar DAG factory for it or set the environment as a DAG parameter (though I think our team preference is for the former to prevent accidental production operations). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing, thanks for all our hard work on this!
.../search_relevancy_sandbox/20230530-implementation_plan_staging_elasticsearch_reindex_dags.md
Show resolved
Hide resolved
6159b34
to
f2224a5
Compare
Co-authored-by: sarayourfriend <[email protected]>
Co-authored-by: Madison Swain-Bowden <[email protected]>
f2224a5
to
75fb276
Compare
Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR: @stacimc Excluding weekend1 days, this PR was ready for review 6 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2. @krysal, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
Thank you all! |
Fixes
Resolves #1987 by @AetherUnbound
Assigned reviewers
Current round
The discussion is currently in the Revision round.
The deadline for review of this round is 2023-06-27.