Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify search query #3261

Merged
merged 6 commits into from
Nov 1, 2023
Merged

Simplify search query #3261

merged 6 commits into from
Nov 1, 2023

Conversation

obulat
Copy link
Contributor

@obulat obulat commented Oct 26, 2023

Fixes

Fixes #3243 by @obulat

Description

Instead of using elasticsearch_dsl for creating the search query, this PR uses a simple dictionary for the 4 query clauses: filter, must_not, must and should. This dictionary is then used to create a much cleaner query which is almost identical in function to the existing query.

The biggest difference is that the url parameter queries (license, aspect ratio, width, extension, etc.) are actually filters. They remove all of the non-matching items from the results, instead of simply assigning them a lower score. Not calculating the score should also in theory make the filters faster.

Query examples

To make these queries similar to the production queries, I checked "Hide content" in the Flickr ContentProvider object form in Django admin. This should dynamically add this provider to exclude query.

No parameters

http://localhost:50280/v1/images/?format=json&page_size=1
Updated

{'bool': {
    'must': [{'match_all': {}}],
    'must_not': [{'term': {'mature': True}}, {'terms': {'provider': ['flickr']}}],
    'should': [{'rank_feature': {'boost': 10000, 'field': 'standardized_popularity'}}]
}}

Old

{'bool': {'must': [{'bool': {'filter': [{'bool': {'must_not': [{'term': {'mature': True}}]}},
                                                  {'bool': {'must_not': [{'terms': {'provider': ['flickr']}}]}}]}}],
                    'should': [{'rank_feature': {'boost': 10000,
                                                 'field': 'standardized_popularity'}}]}}

q search with filters

http://localhost:50280/v1/images/?q=cat&format=json&license=by,cc0&aspect_ratio=wide&page_size=1
Updated

{'bool': {
    'filter': [{'terms': {'license': ['by', 'cc0']}}, {'terms': {'aspect_ratio': ['wide']}}],
    'must': [{'simple_query_string': {'default_operator': 'AND', 'fields': ['title', 'description', 'tags.name'], 'query': 'cat'}}],
    'must_not': [{'term': {'mature': True}}, {'terms': {'provider': ['flickr']}}],
    'should': [{'simple_query_string': {'boost': 10000, 'fields': ['title'], 'query': 'cat'}}, {'rank_feature': {'boost': 10000, 'field': 'standardized_popularity'}}]
}}

Old

{'bool': {
    'must': [{'bool': {'must': [{'bool': {
                 'filter': [
                   {'bool': {'should': [{'terms': {'license': ['by', 'cc0']}}]}},
                   {'bool': {'should': [{'terms': {'aspect_ratio': ['wide']}}]}},
                   {'bool': {'must_not': [{'term': {'mature': True}}]}},
                   {'bool': {'must_not': [{'terms': {'provider': ['flickr']}}]}}
                 ],
                 'must': [{'simple_query_string': {
                                 'default_operator': 'AND', 'fields': ['tags.name', 'title', 'description'],  'query': 'cat'}}]}}],
                 'should': [{'simple_query_string': {'boost': 10000,  'fields': ['title'], 'query': 'cat'}}]}}],
                  'should': [{'rank_feature': {'boost': 10000, 'field': 'standardized_popularity'}}]}},}

title search (without q) with excluded_source param

http://localhost:50280/v1/images/?title=cat&format=json&excluded_source=stocksnap&page_size=1

New

{'query': {'bool': {
    'must': [{'simple_query_string': {'fields': ['title'], 'query': 'cat'}}],
    'must_not': [{'terms': {'source': ['stocksnap']}},{'terms': {'provider': ['flickr']}}, {'term': {'mature': True}}],
     'should': [{'rank_feature': {'boost': 10000, 'field': 'standardized_popularity'}}]}}
}

Old

{ 'query': {'bool': {
    'must': [{'bool': {'filter': [{'bool': {'must_not': [{'terms': {'source': ['stocksnap']}}]}},
                                                  {'bool': {'must_not': [{'term': {'mature': True}}]}},
                                                  {'bool': {'must_not': [{'terms': {'provider': ['flickr']}}]}}],
    'must': [{'simple_query_string': {'fields': ['title'], 'query': 'cat'}}]}}],
     'should': [{'rank_feature': {'boost': 10000, 'field': 'standardized_popularity'}}]
}}}

Unit tests

I added unit tests for the newly extracted search_controller.create_search_query method. I wish it was easy to parametrize these tests, but because we need to check the whole query for each test, it's not easy to do.

While writing the tests, I realized that the categories filter doesn't work anymore. In the search_controller, we still try to handle it converting the deprecated categories param to the current category parameter. However, the serializer does not pass the deprecated categories to the controller! So, I just removed the deprecated categories param from the controller.

Another thing I learned when writing the tests is that license and license_type filters don't interact at all, so they create 2 completely separate terms filters. We should probably combine them, taking an intersection of the two license lists as the ES query filter (e.g., if license=by-nc&license_type=commercial would not return any results with by-nc because the commercial filter would exclude them).

Testing Instructions

Set DEBUG_SCORES=True to the api/.env file so that the Elasticsearch queries a logged in the web Docker container logs.
Run the API using just up.
Go to localhost:50280/admin (use deploy/deploy to log in), and check "Hide content" for one of the providers so that one of the providers is excluded dynamically in the queries (to match the current prod settings).
Try the queries from the PR description and look at the queries logged in Docker logs.
Also compare the queries to the ones on the main branch and see that they are different.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@github-actions github-actions bot added the 🧱 stack: api Related to the Django API label Oct 26, 2023
@obulat obulat force-pushed the simplify-search-query branch from 245eed6 to 5222794 Compare October 26, 2023 15:24
@openverse-bot openverse-bot added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository labels Oct 26, 2023
@obulat obulat force-pushed the simplify-search-query branch 3 times, most recently from f3277a0 to 464a79b Compare October 27, 2023 10:41
@obulat obulat marked this pull request as ready for review October 27, 2023 14:38
@obulat obulat requested a review from a team as a code owner October 27, 2023 14:38
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only request for change is to also update the search algorithm documentation along with these changes: https://docs.openverse.org/api/reference/search_algorithm.html. In particular, using filter instead of should is an important note for the metadata filters. Recording an explanation of why that is the correct choice for each of the fields (or at least a general explanation that is clearly application to each one) would be helpful.

One note about the tests: they're quite specific right now, in that they test the specific functions, and I wonder if they would be less fiddly in the future if we make adjustments to how the search query is built by doing a more "integration" style test by checking the actual query sent to Elasticsearch. This is done elsewhere using pook to intercept the Elasticsearch client call. Not a request for change here because I'm not confident it's correct, but wanted to suggest it in case you had thoughts about the particular testing approach. I could also see the integration style test being more fiddly because it relies on the ES client's implementation of how it sends the query... on the other hand, that's an important thing we should know if it changes. There are trade-offs, in any case.

pages, the number of results, and the ``SearchContext`` as a dict.
Create a list of Elasticsearch queries for filtering search results.
The filter values are given in the request query string.
We use ES filters (`filter`, `must_not`) because we don't need to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty interesting, and I hadn't really considered it deeply before. It would be great to have an updated section of the "search algorithm" documentation that explains why filter is correct for the various fields. I was sceptical when I first read this, but thinking through each of the fields relevant to this function, it makes perfect sense to me now.

Comment on lines 351 to 354
if '"' in query:
base_query_kwargs["quote_field_suffix"] = ".exact"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticing this, but we don't have any .exact subfields. If we switch this to .raw, though, I believe it would start to work, because title, description and tags.name do have .raw subfields that are not analysed (and therefore not stemmed, so should theoretically be able to service exact match queries).

Separate issue though!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting! I didn't look into what .exact does, just copied the existing code :) I'll open a new issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found the PR that added the .exact filter: Make quoted queries behave as described in the API documentation (return exact matches only)

I'll open an issue for this.

Copy link
Contributor Author

@obulat obulat Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"simple_query_string",
**base_query_kwargs,
)
search_queries["must"].append(Q("simple_query_string", **base_query_kwargs))
# Boost exact matches on the title
quotes_stripped = query.replace('"', "")
exact_match_boost = Q(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also theoretically change with the .raw subfields.

api/api/controllers/search_controller.py Outdated Show resolved Hide resolved
api/api/controllers/search_controller.py Outdated Show resolved Hide resolved
@obulat obulat force-pushed the simplify-search-query branch from 5b86981 to e95364e Compare October 30, 2023 14:25
@obulat obulat requested a review from a team as a code owner October 30, 2023 14:25
@github-actions github-actions bot added the 🧱 stack: documentation Related to Sphinx documentation label Oct 30, 2023
@github-actions
Copy link

Full-stack documentation: https://docs.openverse.org/_preview/3261

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

Changed files 🔄:

@openverse-bot
Copy link
Collaborator

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@sarayourfriend
@stacimc
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2.

@obulat, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent! No blocking comments, just some notes and thoughts 🙂 it's very cool to see this so improved!

Comment on lines +382 to +395
search_queries["should"].extend(create_ranking_queries(search_params))

# If there are no `must` query clauses, only the results that match
# the `should` clause are returned. To avoid this, we add an empty
# query clause to the `must` list.
if not search_queries["must"]:
search_queries["must"].append(EMPTY_QUERY)

return Q(
"bool",
filter=search_queries["filter"],
must_not=search_queries["must_not"],
must=search_queries["must"],
should=search_queries["should"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to simplifying the query itself, this is all a lot clearer to read too!

from api.controllers import search_controller


pytestmark = pytest.mark.django_db
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: pytestmark is a little ambiguous, maybe django_db_mark?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytestmark is a pytest feature: https://docs.pytest.org/en/7.4.x/reference/reference.html#globalvar-pytestmark

It has to have this name or it won't have any effect.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OH, TIL! That's super cool 😮


def test_create_search_query_empty(media_type_config):
serializer = media_type_config.search_request_serializer(data={})
serializer.is_valid()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all of these is_valid calls, should we have raise_exception set to True?

Comment on lines +108 to +109
# this is a deprecated param, and it doesn't work because it doesn't exist in the serializer
"categories": "digitized_artwork",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we including it here just to showcase that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included it in the PR to showcase this to the reviewers, but I should probably remove it from the test that will be merged.

Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@obulat obulat force-pushed the simplify-search-query branch from e95364e to 9457da9 Compare November 1, 2023 11:06
@obulat
Copy link
Contributor Author

obulat commented Nov 1, 2023

One note about the tests: they're quite specific right now, in that they test the specific functions, and I wonder if they would be less fiddly in the future if we make adjustments to how the search query is built by doing a more "integration" style test by checking the actual query sent to Elasticsearch. This is done elsewhere using pook to intercept the Elasticsearch client call. Not a request for change here because I'm not confident it's correct, but wanted to suggest it in case you had thoughts about the particular testing approach. I could also see the integration style test being more fiddly because it relies on the ES client's implementation of how it sends the query... on the other hand, that's an important thing we should know if it changes. There are trade-offs, in any case.

Thank you for the suggestion, @sarayourfriend. I will leave the tests as they are here, but checking the ES queries seems valuable. Maybe we should use both?

@obulat obulat merged commit 535ded0 into main Nov 1, 2023
44 checks passed
@obulat obulat deleted the simplify-search-query branch November 1, 2023 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API 🧱 stack: documentation Related to Sphinx documentation
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Search controller uses should for filtering instead of filter
4 participants