Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify related query to remove nesting and make more performant #3307

Merged
merged 7 commits into from
Nov 9, 2023

Conversation

obulat
Copy link
Contributor

@obulat obulat commented Nov 2, 2023

Fixes

Fixes #3306 by @obulat

Description

This PR refactors the way related query is created, using more low-level queries to make them less nested.

The main changes for this PR are within the elasticsearch/related.py file. All other changes are just moving the functions to a different file. The related function is moved from search_controller module to a separate file within elsticsearch folder, and some other common functions for pagination to elasticsearch/helpers.py file to make the files easier to read.

Converting the simple search query to the terms query

After opening this PR, I also checked the queries logged by slowlog: all of the queries logged in the recent hours were related queries.
Using the simple_query_string for the list of tags is not performant. This PR converts the query to use the terms query for tags. It will mean that the tags will not match if the form is different (so, cat will not match cats), but since we are trying to match all of the tags, I think we should still have enough matches.

Testing Instructions

Go to the /admin endpoint and set hide content for either Stocksnap or Flickr to true (in ContentModel).
Add logging to the related endpoint (s.to_dict() to view the ES query), and check that the query sent to ES when the related endpoint is requested does not have the nesting described in the issue.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@obulat obulat requested a review from a team as a code owner November 2, 2023 09:00
@github-actions github-actions bot added the 🧱 stack: api Related to the Django API label Nov 2, 2023
@openverse-bot openverse-bot added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository labels Nov 2, 2023
@obulat obulat force-pushed the simplify_related_query branch 2 times, most recently from b6ed3e7 to a5bb1a5 Compare November 3, 2023 04:12
@obulat obulat changed the title Simplify related query to remove nesting Simplify related query to remove nesting and make more performant Nov 6, 2023
@obulat obulat force-pushed the simplify_related_query branch 2 times, most recently from f88b556 to bc8934c Compare November 6, 2023 07:44
@obulat obulat changed the base branch from main to fix/timing-of-es-queries November 6, 2023 07:44
Base automatically changed from fix/timing-of-es-queries to main November 7, 2023 15:21
@obulat obulat force-pushed the simplify_related_query branch from bc8934c to b1e2e52 Compare November 7, 2023 15:24
@openverse-bot
Copy link
Collaborator

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@sarayourfriend
@stacimc
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 3 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2.

@obulat, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@obulat could you add some new unit tests, preferrably ones that run against the previous implementation and then still pass with the new one? The only test for related I can find is that it returns a 200 (https://github.com//WordPress/openverse/blob/bc7b2a7f6ee32fc3ca367baae7681abc41ef7f79/api/test/media_integration.py#L152-L155) but it doesn't confirm anything about the query actually doing what we want it to do (for example, by testing that the query result is the expected one).

As it stands, this might work fine (I haven't tested it locally yet, the code looks fine) but I'm uncomfortable approving it without any tests that would verify the behaviour of the query is the same (or at least, changing in an expected way).

Also: I'd like to have a deployment plan for this that includes setting up a dashboard to monitor the changes in response times for the related endpoint. I've saved a Logs Insights query in CloudWatch called "parsed requests" (something along those lines) that exposes an isRelated flag, which we could use to create a query like this:

# ... saved query with parsing and creation if `isRelated` field
| filter isRelated
| stats avg(upstream_response_time) by bin(5m)

I've started this work for https://github.com/WordPress/openverse-infrastructure/issues/651 but likely won't be able to get to it until my Friday, but would seem to be a prerequisite to me for any significant changes we make to the related endpoint.

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @sarayourfriend - I like the code changes and they make sense, having tests for this would be useful and having a dashboard for monitoring production when deployed feels necessary. I appreciate how much more readable the search building code is with all the changes you've been making @obulat!

api/api/controllers/elasticsearch/related.py Outdated Show resolved Hide resolved
related_query["should"].append(Q("terms", tags__name=tags))

# Exclude the dynamically disabled sources.
if excluded_providers_query := get_excluded_providers_query():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love that we have this now!

@sarayourfriend
Copy link
Collaborator

Regarding the dashboard: I actually have a few minutes now I think I can have the basic thing up and we can iterate it after deployment if needed. The data will be there before/after because it's just derived from the logs anyway, so the dashboard doesn't need to be perfect at deployment time, just comprehensible enough that we can monitor the deployment and make sure this change doesn't somehow negatively effect the response times (which, to be clear, I don't think will be the case, just want to make sure we can move forward with confidence!).

@sarayourfriend
Copy link
Collaborator

Update on the dashboard: I implemented it earlier today, and it's available at a link in the issue: https://github.com/WordPress/openverse-infrastructure/issues/651

With unit tests this will be good-to-go from my perspective 😁

Signed-off-by: Olga Bulat <[email protected]>
@obulat obulat force-pushed the simplify_related_query branch from c5394b2 to 91ae7fe Compare November 8, 2023 13:09
@obulat
Copy link
Contributor Author

obulat commented Nov 8, 2023

Thank you so much for setting up the dashboard, @sarayourfriend!

I've updated this PR so that the first commit is adding the test for related_media: b312c5e
The following commits add the changes, and the last commit updates the related test.

I've added assertions for the ES query to check that the correct query is set. It is not very consistent with other tests, because instead of using assert, it uses the .json in pook request setup.

I wanted to add a check that the results' title and tags have common words, but realized that it's not possible with this implementation since we use mocked results instead of the "related" results. I will see if I can use the integrated tests for this.

@obulat obulat force-pushed the simplify_related_query branch from 91ae7fe to 6d91d59 Compare November 8, 2023 13:24
@obulat obulat force-pushed the simplify_related_query branch from 6d91d59 to 568f754 Compare November 8, 2023 13:28
@obulat obulat requested a review from sarayourfriend November 8, 2023 13:41
@obulat obulat requested a review from AetherUnbound November 8, 2023 13:41
@sarayourfriend
Copy link
Collaborator

I've added assertions for the ES query to check that the correct query is set. It is not very consistent with other tests, because instead of using assert, it uses the .json in pook request setup.

For what it's worth, I think that's fine, there are other ES tests that do this:

https://github.com//WordPress/openverse/blob/95b59110eff953f7d30d9b9444791ae2e0dc6d68/api/test/unit/controllers/test_search_controller.py#L584-L591

I wanted to add a check that the results' title and tags have common words, but realized that it's not possible with this implementation since we use mocked results instead of the "related" results. I will see if I can use the integrated tests for this.

You're right, I think it's only at the integration test level that we can actually test this in a meaningful way.

@obulat
Copy link
Contributor Author

obulat commented Nov 8, 2023

For what it's worth, I think that's fine, there are other ES tests that do this:

I think this example is a little bit different from what I did in the test here. In the linked example, there's only 1 .json, which is used to set the response. In this PR, however, .json is used twice: first for making sure that the request body matches the shape we expect, and second to mock the response:

    mock_related = (
        pook.post(es_filtered_index_endpoint)
 >      .json(es_related_query)  # Testing that ES query is correct 
        .times(1)
        .reply(200)
        .header("x-elastic-product", "Elasticsearch")
 >      .json(mock_es_response)
        .mock
    )

@sarayourfriend
Copy link
Collaborator

Sure, but the example I showed just uses a regex to test the body of the request. There's no real difference, the end result is pook either compares the body as strings using == (.json just tells pook to serialise the argument to a JSON string, it then compares the bodies as strings), or executes a regex test. In other words, the tests are really the same, the example I shared just tests a part of the body, whereas passing .json on the pook request tests the whole body.

Anyway, it's all good, both are perfectly fine approaches as far as I'm concerned and well suited for different testing needs.

@obulat
Copy link
Contributor Author

obulat commented Nov 8, 2023

I added the check that there are at least 1 word in common in the related results with the main item (in the title and tags) in 8a5e3e4

@obulat obulat closed this Nov 8, 2023
@obulat obulat reopened this Nov 8, 2023
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Excited to see how/if the endpoint timings respond 🤞

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the new tests, I'm excited to try this out!

@obulat obulat merged commit 7bb4298 into main Nov 9, 2023
58 of 80 checks passed
@obulat obulat deleted the simplify_related_query branch November 9, 2023 02:16
# Only use the first 10 tags
if tags:
tags = [tag["name"] for tag in tags[:10]]
related_query["should"].append(Q("terms", tags__name=tags))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for follow up @obulat:

Suggested change
related_query["should"].append(Q("terms", tags__name=tags))
related_query["should"].append(Q("terms", tags__name__keyword=tags))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #3346

@sarayourfriend
Copy link
Collaborator

Just wanted to share an update. We deployed this an hour ago and the related image response times are dramaticly improved:
image

Nice work @obulat!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Excluded providers clause in the related query is inefficient
4 participants