Simplify search query #3261

obulat · 2023-10-26T15:23:20Z

Fixes

Description

Instead of using elasticsearch_dsl for creating the search query, this PR uses a simple dictionary for the 4 query clauses: filter, must_not, must and should. This dictionary is then used to create a much cleaner query which is almost identical in function to the existing query.

The biggest difference is that the url parameter queries (license, aspect ratio, width, extension, etc.) are actually filters. They remove all of the non-matching items from the results, instead of simply assigning them a lower score. Not calculating the score should also in theory make the filters faster.

Query examples

To make these queries similar to the production queries, I checked "Hide content" in the Flickr ContentProvider object form in Django admin. This should dynamically add this provider to exclude query.

No parameters

http://localhost:50280/v1/images/?format=json&page_size=1
Updated

{'bool': {
    'must': [{'match_all': {}}],
    'must_not': [{'term': {'mature': True}}, {'terms': {'provider': ['flickr']}}],
    'should': [{'rank_feature': {'boost': 10000, 'field': 'standardized_popularity'}}]
}}

Old

{'bool': {'must': [{'bool': {'filter': [{'bool': {'must_not': [{'term': {'mature': True}}]}},
                                                  {'bool': {'must_not': [{'terms': {'provider': ['flickr']}}]}}]}}],
                    'should': [{'rank_feature': {'boost': 10000,
                                                 'field': 'standardized_popularity'}}]}}

`q` search with filters

http://localhost:50280/v1/images/?q=cat&format=json&license=by,cc0&aspect_ratio=wide&page_size=1
Updated

{'bool': {
    'filter': [{'terms': {'license': ['by', 'cc0']}}, {'terms': {'aspect_ratio': ['wide']}}],
    'must': [{'simple_query_string': {'default_operator': 'AND', 'fields': ['title', 'description', 'tags.name'], 'query': 'cat'}}],
    'must_not': [{'term': {'mature': True}}, {'terms': {'provider': ['flickr']}}],
    'should': [{'simple_query_string': {'boost': 10000, 'fields': ['title'], 'query': 'cat'}}, {'rank_feature': {'boost': 10000, 'field': 'standardized_popularity'}}]
}}

Old

{'bool': {
    'must': [{'bool': {'must': [{'bool': {
                 'filter': [
                   {'bool': {'should': [{'terms': {'license': ['by', 'cc0']}}]}},
                   {'bool': {'should': [{'terms': {'aspect_ratio': ['wide']}}]}},
                   {'bool': {'must_not': [{'term': {'mature': True}}]}},
                   {'bool': {'must_not': [{'terms': {'provider': ['flickr']}}]}}
                 ],
                 'must': [{'simple_query_string': {
                                 'default_operator': 'AND', 'fields': ['tags.name', 'title', 'description'],  'query': 'cat'}}]}}],
                 'should': [{'simple_query_string': {'boost': 10000,  'fields': ['title'], 'query': 'cat'}}]}}],
                  'should': [{'rank_feature': {'boost': 10000, 'field': 'standardized_popularity'}}]}},}

`title` search (without `q`) with `excluded_source` param

http://localhost:50280/v1/images/?title=cat&format=json&excluded_source=stocksnap&page_size=1

New

{'query': {'bool': {
    'must': [{'simple_query_string': {'fields': ['title'], 'query': 'cat'}}],
    'must_not': [{'terms': {'source': ['stocksnap']}},{'terms': {'provider': ['flickr']}}, {'term': {'mature': True}}],
     'should': [{'rank_feature': {'boost': 10000, 'field': 'standardized_popularity'}}]}}
}

Old

{ 'query': {'bool': {
    'must': [{'bool': {'filter': [{'bool': {'must_not': [{'terms': {'source': ['stocksnap']}}]}},
                                                  {'bool': {'must_not': [{'term': {'mature': True}}]}},
                                                  {'bool': {'must_not': [{'terms': {'provider': ['flickr']}}]}}],
    'must': [{'simple_query_string': {'fields': ['title'], 'query': 'cat'}}]}}],
     'should': [{'rank_feature': {'boost': 10000, 'field': 'standardized_popularity'}}]
}}}

Unit tests

I added unit tests for the newly extracted search_controller.create_search_query method. I wish it was easy to parametrize these tests, but because we need to check the whole query for each test, it's not easy to do.

While writing the tests, I realized that the categories filter doesn't work anymore. In the search_controller, we still try to handle it converting the deprecated categories param to the current category parameter. However, the serializer does not pass the deprecated categories to the controller! So, I just removed the deprecated categories param from the controller.

Another thing I learned when writing the tests is that license and license_type filters don't interact at all, so they create 2 completely separate terms filters. We should probably combine them, taking an intersection of the two license lists as the ES query filter (e.g., if license=by-nc&license_type=commercial would not return any results with by-nc because the commercial filter would exclude them).

Testing Instructions

Set DEBUG_SCORES=True to the api/.env file so that the Elasticsearch queries a logged in the web Docker container logs.
Run the API using just up.
Go to localhost:50280/admin (use deploy/deploy to log in), and check "Hide content" for one of the providers so that one of the providers is excluded dynamically in the queries (to match the current prod settings).
Try the queries from the PR description and look at the queries logged in Docker logs.
Also compare the queries to the ones on the main branch and see that they are different.

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

sarayourfriend

My only request for change is to also update the search algorithm documentation along with these changes: https://docs.openverse.org/api/reference/search_algorithm.html. In particular, using filter instead of should is an important note for the metadata filters. Recording an explanation of why that is the correct choice for each of the fields (or at least a general explanation that is clearly application to each one) would be helpful.

One note about the tests: they're quite specific right now, in that they test the specific functions, and I wonder if they would be less fiddly in the future if we make adjustments to how the search query is built by doing a more "integration" style test by checking the actual query sent to Elasticsearch. This is done elsewhere using pook to intercept the Elasticsearch client call. Not a request for change here because I'm not confident it's correct, but wanted to suggest it in case you had thoughts about the particular testing approach. I could also see the integration style test being more fiddly because it relies on the ES client's implementation of how it sends the query... on the other hand, that's an important thing we should know if it changes. There are trade-offs, in any case.

sarayourfriend · 2023-10-29T23:53:15Z

api/api/controllers/search_controller.py

-    pages, the number of results, and the ``SearchContext`` as a dict.
+    Create a list of Elasticsearch queries for filtering search results.
+    The filter values are given in the request query string.
+    We use ES filters (`filter`, `must_not`) because we don't need to


This is pretty interesting, and I hadn't really considered it deeply before. It would be great to have an updated section of the "search algorithm" documentation that explains why filter is correct for the various fields. I was sceptical when I first read this, but thinking through each of the fields relevant to this function, it makes perfect sense to me now.

sarayourfriend · 2023-10-30T00:22:18Z

api/api/controllers/search_controller.py

        if '"' in query:
            base_query_kwargs["quote_field_suffix"] = ".exact"


Just noticing this, but we don't have any .exact subfields. If we switch this to .raw, though, I believe it would start to work, because title, description and tags.name do have .raw subfields that are not analysed (and therefore not stemmed, so should theoretically be able to service exact match queries).

Separate issue though!

That's interesting! I didn't look into what .exact does, just copied the existing code :) I'll open a new issue.

Found the PR that added the .exact filter: Make quoted queries behave as described in the API documentation (return exact matches only)

I'll open an issue for this.

Issue: Correctly set up mixing of exact search with stemming

sarayourfriend · 2023-10-30T00:23:13Z

api/api/controllers/search_controller.py

-            "simple_query_string",
-            **base_query_kwargs,
-        )
+        search_queries["must"].append(Q("simple_query_string", **base_query_kwargs))
        # Boost exact matches on the title
        quotes_stripped = query.replace('"', "")
        exact_match_boost = Q(


This would also theoretically change with the .raw subfields.

api/api/controllers/search_controller.py

github-actions · 2023-10-30T14:42:37Z

Full-stack documentation: https://docs.openverse.org/_preview/3261

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

Changed files 🔄:

https://docs.openverse.org/_preview/3261/api/reference/search_algorithm.html

openverse-bot · 2023-11-01T00:00:12Z

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@sarayourfriend
@stacimc
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend¹ days, this PR was ready for review 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)².

@obulat, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Specifically, Saturday and Sunday. ↩
For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range. ↩

AetherUnbound

This is excellent! No blocking comments, just some notes and thoughts 🙂 it's very cool to see this so improved!

AetherUnbound · 2023-11-01T01:13:42Z

api/api/controllers/search_controller.py

+        search_queries["should"].extend(create_ranking_queries(search_params))
+
+    # If there are no `must` query clauses, only the results that match
+    # the `should` clause are returned. To avoid this, we add an empty
+    # query clause to the `must` list.
+    if not search_queries["must"]:
+        search_queries["must"].append(EMPTY_QUERY)
+
+    return Q(
+        "bool",
+        filter=search_queries["filter"],
+        must_not=search_queries["must_not"],
+        must=search_queries["must"],
+        should=search_queries["should"],


In addition to simplifying the query itself, this is all a lot clearer to read too!

AetherUnbound · 2023-11-01T01:36:00Z

api/test/unit/controllers/test_search_controller_search_query.py

+from api.controllers import search_controller
+
+
+pytestmark = pytest.mark.django_db


Nit: pytestmark is a little ambiguous, maybe django_db_mark?

pytestmark is a pytest feature: https://docs.pytest.org/en/7.4.x/reference/reference.html#globalvar-pytestmark

It has to have this name or it won't have any effect.

OH, TIL! That's super cool 😮

AetherUnbound · 2023-11-01T01:42:34Z

api/test/unit/controllers/test_search_controller_search_query.py

+
+def test_create_search_query_empty(media_type_config):
+    serializer = media_type_config.search_request_serializer(data={})
+    serializer.is_valid()


For all of these is_valid calls, should we have raise_exception set to True?

AetherUnbound · 2023-11-01T02:15:46Z

api/test/unit/controllers/test_search_controller_search_query.py

+            # this is a deprecated param, and it doesn't work because it doesn't exist in the serializer
+            "categories": "digitized_artwork",


Are we including it here just to showcase that?

I included it in the PR to showcase this to the reviewers, but I should probably remove it from the test that will be merged.

sarayourfriend

LGTM!

Co-authored-by: sarayourfriend <[email protected]>

Signed-off-by: Olga Bulat <[email protected]>

obulat · 2023-11-01T11:23:35Z

One note about the tests: they're quite specific right now, in that they test the specific functions, and I wonder if they would be less fiddly in the future if we make adjustments to how the search query is built by doing a more "integration" style test by checking the actual query sent to Elasticsearch. This is done elsewhere using pook to intercept the Elasticsearch client call. Not a request for change here because I'm not confident it's correct, but wanted to suggest it in case you had thoughts about the particular testing approach. I could also see the integration style test being more fiddly because it relies on the ES client's implementation of how it sends the query... on the other hand, that's an important thing we should know if it changes. There are trade-offs, in any case.

Thank you for the suggestion, @sarayourfriend. I will leave the tests as they are here, but checking the ES queries seems valuable. Maybe we should use both?

github-actions bot added the 🧱 stack: api Related to the Django API label Oct 26, 2023

obulat force-pushed the simplify-search-query branch from 245eed6 to 5222794 Compare October 26, 2023 15:24

openverse-bot added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository labels Oct 26, 2023

obulat force-pushed the simplify-search-query branch 3 times, most recently from f3277a0 to 464a79b Compare October 27, 2023 10:41

obulat marked this pull request as ready for review October 27, 2023 14:38

obulat requested a review from a team as a code owner October 27, 2023 14:38

obulat requested review from sarayourfriend and stacimc October 27, 2023 14:38

sarayourfriend requested changes Oct 30, 2023

View reviewed changes

obulat force-pushed the simplify-search-query branch from 5b86981 to e95364e Compare October 30, 2023 14:25

obulat requested a review from a team as a code owner October 30, 2023 14:25

github-actions bot added the 🧱 stack: documentation Related to Sphinx documentation label Oct 30, 2023

obulat mentioned this pull request Oct 30, 2023

Correctly set up mixing of exact search with stemming #3269

Closed

obulat requested a review from sarayourfriend October 30, 2023 14:49

AetherUnbound approved these changes Nov 1, 2023

View reviewed changes

sarayourfriend approved these changes Nov 1, 2023

View reviewed changes

obulat and others added 6 commits November 1, 2023 14:03

Simplify search query

ff57abe

Add unit tests

bee281c

Update api/api/controllers/search_controller.py

dfa99f0

Co-authored-by: sarayourfriend <[email protected]>

Refactor excluded providers function

7c73d9c

Signed-off-by: Olga Bulat <[email protected]>

Add documentation about filters

325895c

Signed-off-by: Olga Bulat <[email protected]>

Raises exception in serializer

9457da9

Signed-off-by: Olga Bulat <[email protected]>

obulat force-pushed the simplify-search-query branch from e95364e to 9457da9 Compare November 1, 2023 11:06

obulat merged commit 535ded0 into main Nov 1, 2023
44 checks passed

obulat deleted the simplify-search-query branch November 1, 2023 11:24

obulat mentioned this pull request Nov 5, 2023

Excluded providers clause in the related query is inefficient #3306

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify search query #3261

Simplify search query #3261

obulat commented Oct 26, 2023 •

edited

Loading

sarayourfriend left a comment •

edited

Loading

sarayourfriend Oct 29, 2023

sarayourfriend Oct 30, 2023

obulat Oct 30, 2023

obulat Oct 30, 2023

obulat Oct 30, 2023 •

edited

Loading

sarayourfriend Oct 30, 2023

github-actions bot commented Oct 30, 2023

openverse-bot commented Nov 1, 2023

AetherUnbound left a comment

AetherUnbound Nov 1, 2023

AetherUnbound Nov 1, 2023

sarayourfriend Nov 1, 2023

AetherUnbound Nov 1, 2023

AetherUnbound Nov 1, 2023

AetherUnbound Nov 1, 2023

obulat Nov 1, 2023

sarayourfriend left a comment

obulat commented Nov 1, 2023

		if '"' in query:
		base_query_kwargs["quote_field_suffix"] = ".exact"

		from api.controllers import search_controller


		pytestmark = pytest.mark.django_db

		# this is a deprecated param, and it doesn't work because it doesn't exist in the serializer
		"categories": "digitized_artwork",

Simplify search query #3261

Simplify search query #3261

Conversation

obulat commented Oct 26, 2023 • edited Loading

Fixes

Description

Query examples

No parameters

q search with filters

title search (without q) with excluded_source param

Unit tests

Testing Instructions

Checklist

Developer Certificate of Origin

sarayourfriend left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obulat Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Oct 30, 2023

openverse-bot commented Nov 1, 2023

Footnotes

AetherUnbound left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarayourfriend left a comment

Choose a reason for hiding this comment

obulat commented Nov 1, 2023

obulat commented Oct 26, 2023 •

edited

Loading

`q` search with filters

`title` search (without `q`) with `excluded_source` param

sarayourfriend left a comment •

edited

Loading

obulat Oct 30, 2023 •

edited

Loading