Simplify related query to remove nesting and make more performant #3307

obulat · 2023-11-02T09:00:40Z

Fixes

Description

This PR refactors the way related query is created, using more low-level queries to make them less nested.

The main changes for this PR are within the elasticsearch/related.py file. All other changes are just moving the functions to a different file. The related function is moved from search_controller module to a separate file within elsticsearch folder, and some other common functions for pagination to elasticsearch/helpers.py file to make the files easier to read.

Converting the simple search query to the terms query

After opening this PR, I also checked the queries logged by slowlog: all of the queries logged in the recent hours were related queries.
Using the simple_query_string for the list of tags is not performant. This PR converts the query to use the terms query for tags. It will mean that the tags will not match if the form is different (so, cat will not match cats), but since we are trying to match all of the tags, I think we should still have enough matches.

Testing Instructions

Go to the /admin endpoint and set hide content for either Stocksnap or Flickr to true (in ContentModel).
Add logging to the related endpoint (s.to_dict() to view the ES query), and check that the query sent to ES when the related endpoint is requested does not have the nesting described in the issue.

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

openverse-bot · 2023-11-08T00:00:13Z

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@sarayourfriend
@stacimc
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend¹ days, this PR was ready for review 3 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)².

@obulat, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Specifically, Saturday and Sunday. ↩
For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range. ↩

api/test/unit/controllers/test_search_controller.py

sarayourfriend

@obulat could you add some new unit tests, preferrably ones that run against the previous implementation and then still pass with the new one? The only test for related I can find is that it returns a 200 (https://github.com//WordPress/openverse/blob/bc7b2a7f6ee32fc3ca367baae7681abc41ef7f79/api/test/media_integration.py#L152-L155) but it doesn't confirm anything about the query actually doing what we want it to do (for example, by testing that the query result is the expected one).

As it stands, this might work fine (I haven't tested it locally yet, the code looks fine) but I'm uncomfortable approving it without any tests that would verify the behaviour of the query is the same (or at least, changing in an expected way).

Also: I'd like to have a deployment plan for this that includes setting up a dashboard to monitor the changes in response times for the related endpoint. I've saved a Logs Insights query in CloudWatch called "parsed requests" (something along those lines) that exposes an isRelated flag, which we could use to create a query like this:

# ... saved query with parsing and creation if `isRelated` field
| filter isRelated
| stats avg(upstream_response_time) by bin(5m)

I've started this work for https://github.com/WordPress/openverse-infrastructure/issues/651 but likely won't be able to get to it until my Friday, but would seem to be a prerequisite to me for any significant changes we make to the related endpoint.

AetherUnbound

Agreed with @sarayourfriend - I like the code changes and they make sense, having tests for this would be useful and having a dashboard for monitoring production when deployed feels necessary. I appreciate how much more readable the search building code is with all the changes you've been making @obulat!

api/api/controllers/elasticsearch/related.py

AetherUnbound · 2023-11-08T00:46:55Z

api/api/controllers/elasticsearch/related.py

+            related_query["should"].append(Q("terms", tags__name=tags))
+
+    # Exclude the dynamically disabled sources.
+    if excluded_providers_query := get_excluded_providers_query():


Love that we have this now!

sarayourfriend · 2023-11-08T00:50:01Z

Regarding the dashboard: I actually have a few minutes now I think I can have the basic thing up and we can iterate it after deployment if needed. The data will be there before/after because it's just derived from the logs anyway, so the dashboard doesn't need to be perfect at deployment time, just comprehensible enough that we can monitor the deployment and make sure this change doesn't somehow negatively effect the response times (which, to be clear, I don't think will be the case, just want to make sure we can move forward with confidence!).

sarayourfriend · 2023-11-08T06:09:34Z

Update on the dashboard: I implemented it earlier today, and it's available at a link in the issue: https://github.com/WordPress/openverse-infrastructure/issues/651

With unit tests this will be good-to-go from my perspective 😁

Signed-off-by: Olga Bulat <[email protected]>

obulat · 2023-11-08T13:15:26Z

Thank you so much for setting up the dashboard, @sarayourfriend!

I've updated this PR so that the first commit is adding the test for related_media: b312c5e
The following commits add the changes, and the last commit updates the related test.

I've added assertions for the ES query to check that the correct query is set. It is not very consistent with other tests, because instead of using assert, it uses the .json in pook request setup.

I wanted to add a check that the results' title and tags have common words, but realized that it's not possible with this implementation since we use mocked results instead of the "related" results. I will see if I can use the integrated tests for this.

Signed-off-by: Olga Bulat <[email protected]>

sarayourfriend · 2023-11-08T17:56:16Z

I've added assertions for the ES query to check that the correct query is set. It is not very consistent with other tests, because instead of using assert, it uses the .json in pook request setup.

For what it's worth, I think that's fine, there are other ES tests that do this:

https://github.com//WordPress/openverse/blob/95b59110eff953f7d30d9b9444791ae2e0dc6d68/api/test/unit/controllers/test_search_controller.py#L584-L591

I wanted to add a check that the results' title and tags have common words, but realized that it's not possible with this implementation since we use mocked results instead of the "related" results. I will see if I can use the integrated tests for this.

You're right, I think it's only at the integration test level that we can actually test this in a meaningful way.

obulat · 2023-11-08T18:03:32Z

For what it's worth, I think that's fine, there are other ES tests that do this:

I think this example is a little bit different from what I did in the test here. In the linked example, there's only 1 .json, which is used to set the response. In this PR, however, .json is used twice: first for making sure that the request body matches the shape we expect, and second to mock the response:

    mock_related = (
        pook.post(es_filtered_index_endpoint)
 >      .json(es_related_query)  # Testing that ES query is correct 
        .times(1)
        .reply(200)
        .header("x-elastic-product", "Elasticsearch")
 >      .json(mock_es_response)
        .mock
    )

sarayourfriend · 2023-11-08T18:11:19Z

Sure, but the example I showed just uses a regex to test the body of the request. There's no real difference, the end result is pook either compares the body as strings using == (.json just tells pook to serialise the argument to a JSON string, it then compares the bodies as strings), or executes a regex test. In other words, the tests are really the same, the example I shared just tests a part of the body, whereas passing .json on the pook request tests the whole body.

Anyway, it's all good, both are perfectly fine approaches as far as I'm concerned and well suited for different testing needs.

Signed-off-by: Olga Bulat <[email protected]>

obulat · 2023-11-08T18:36:44Z

I added the check that there are at least 1 word in common in the related results with the main item (in the title and tags) in 8a5e3e4

sarayourfriend

LGTM! Excited to see how/if the endpoint timings respond 🤞

AetherUnbound

Thanks for adding the new tests, I'm excited to try this out!

sarayourfriend · 2023-11-13T07:55:41Z

api/api/controllers/elasticsearch/related.py

+        # Only use the first 10 tags
+        if tags:
+            tags = [tag["name"] for tag in tags[:10]]
+            related_query["should"].append(Q("terms", tags__name=tags))


Note for follow up @obulat:

Suggested change

related_query["should"].append(Q("terms", tags__name=tags))

related_query["should"].append(Q("terms", tags__name__keyword=tags))

sarayourfriend · 2023-11-13T23:22:15Z

Just wanted to share an update. We deployed this an hour ago and the related image response times are dramaticly improved:

Nice work @obulat!

obulat requested a review from a team as a code owner November 2, 2023 09:00

obulat requested review from sarayourfriend and stacimc November 2, 2023 09:00

github-actions bot added the 🧱 stack: api Related to the Django API label Nov 2, 2023

openverse-bot added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository labels Nov 2, 2023

obulat force-pushed the simplify_related_query branch 2 times, most recently from b6ed3e7 to a5bb1a5 Compare November 3, 2023 04:12

obulat changed the title ~~Simplify related query to remove nesting~~ Simplify related query to remove nesting and make more performant Nov 6, 2023

obulat force-pushed the simplify_related_query branch 2 times, most recently from f88b556 to bc8934c Compare November 6, 2023 07:44

obulat changed the base branch from main to fix/timing-of-es-queries November 6, 2023 07:44

Base automatically changed from fix/timing-of-es-queries to main November 7, 2023 15:21

obulat force-pushed the simplify_related_query branch from bc8934c to b1e2e52 Compare November 7, 2023 15:24

sarayourfriend reviewed Nov 8, 2023

View reviewed changes

api/test/unit/controllers/test_search_controller.py Outdated Show resolved Hide resolved

sarayourfriend requested changes Nov 8, 2023

View reviewed changes

AetherUnbound reviewed Nov 8, 2023

View reviewed changes

Add related test

b312c5e

Signed-off-by: Olga Bulat <[email protected]>

obulat force-pushed the simplify_related_query branch from c5394b2 to 91ae7fe Compare November 8, 2023 13:09

Simplify related query to remove nesting

cb8ed3c

Signed-off-by: Olga Bulat <[email protected]>

obulat force-pushed the simplify_related_query branch from 91ae7fe to 6d91d59 Compare November 8, 2023 13:24

obulat added 2 commits November 8, 2023 16:28

Use terms query for tags in related

f382ab5

Signed-off-by: Olga Bulat <[email protected]>

Update the unit test

568f754

Signed-off-by: Olga Bulat <[email protected]>

obulat force-pushed the simplify_related_query branch from 6d91d59 to 568f754 Compare November 8, 2023 13:28

obulat requested a review from sarayourfriend November 8, 2023 13:41

obulat requested a review from AetherUnbound November 8, 2023 13:41

Add excluded providers cache

9e65829

Signed-off-by: Olga Bulat <[email protected]>

obulat added 2 commits November 8, 2023 21:21

Test number of related results in integration

3bc1079

Signed-off-by: Olga Bulat <[email protected]>

Test that the results are related in integration

8a5e3e4

Signed-off-by: Olga Bulat <[email protected]>

obulat closed this Nov 8, 2023

obulat reopened this Nov 8, 2023

sarayourfriend approved these changes Nov 8, 2023

View reviewed changes

AetherUnbound approved these changes Nov 8, 2023

View reviewed changes

obulat merged commit 7bb4298 into main Nov 9, 2023
58 of 80 checks passed

obulat deleted the simplify_related_query branch November 9, 2023 02:16

sarayourfriend reviewed Nov 13, 2023

View reviewed changes

obulat mentioned this pull request Nov 13, 2023

Use the keyword field for tags in related query #3346

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify related query to remove nesting and make more performant #3307

Simplify related query to remove nesting and make more performant #3307

obulat commented Nov 2, 2023 •

edited

Loading

openverse-bot commented Nov 8, 2023

sarayourfriend left a comment

AetherUnbound left a comment

AetherUnbound Nov 8, 2023

sarayourfriend commented Nov 8, 2023

sarayourfriend commented Nov 8, 2023

obulat commented Nov 8, 2023 •

edited

Loading

sarayourfriend commented Nov 8, 2023

obulat commented Nov 8, 2023

sarayourfriend commented Nov 8, 2023

obulat commented Nov 8, 2023

sarayourfriend left a comment

AetherUnbound left a comment

sarayourfriend Nov 13, 2023

obulat Nov 13, 2023

sarayourfriend commented Nov 13, 2023

	related_query["should"].append(Q("terms", tags__name=tags))
	related_query["should"].append(Q("terms", tags__name__keyword=tags))

Simplify related query to remove nesting and make more performant #3307

Simplify related query to remove nesting and make more performant #3307

Conversation

obulat commented Nov 2, 2023 • edited Loading

Fixes

Description

Converting the simple search query to the terms query

Testing Instructions

Checklist

Developer Certificate of Origin

openverse-bot commented Nov 8, 2023

Footnotes

sarayourfriend left a comment

Choose a reason for hiding this comment

AetherUnbound left a comment

Choose a reason for hiding this comment

AetherUnbound Nov 8, 2023

Choose a reason for hiding this comment

sarayourfriend commented Nov 8, 2023

sarayourfriend commented Nov 8, 2023

obulat commented Nov 8, 2023 • edited Loading

sarayourfriend commented Nov 8, 2023

obulat commented Nov 8, 2023

sarayourfriend commented Nov 8, 2023

obulat commented Nov 8, 2023

sarayourfriend left a comment

Choose a reason for hiding this comment

AetherUnbound left a comment

Choose a reason for hiding this comment

sarayourfriend Nov 13, 2023

Choose a reason for hiding this comment

obulat Nov 13, 2023

Choose a reason for hiding this comment

sarayourfriend commented Nov 13, 2023

obulat commented Nov 2, 2023 •

edited

Loading

obulat commented Nov 8, 2023 •

edited

Loading