-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify search query #3261
Simplify search query #3261
Conversation
245eed6
to
5222794
Compare
f3277a0
to
464a79b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only request for change is to also update the search algorithm documentation along with these changes: https://docs.openverse.org/api/reference/search_algorithm.html. In particular, using filter
instead of should
is an important note for the metadata filters. Recording an explanation of why that is the correct choice for each of the fields (or at least a general explanation that is clearly application to each one) would be helpful.
One note about the tests: they're quite specific right now, in that they test the specific functions, and I wonder if they would be less fiddly in the future if we make adjustments to how the search query is built by doing a more "integration" style test by checking the actual query sent to Elasticsearch. This is done elsewhere using pook
to intercept the Elasticsearch client call. Not a request for change here because I'm not confident it's correct, but wanted to suggest it in case you had thoughts about the particular testing approach. I could also see the integration style test being more fiddly because it relies on the ES client's implementation of how it sends the query... on the other hand, that's an important thing we should know if it changes. There are trade-offs, in any case.
pages, the number of results, and the ``SearchContext`` as a dict. | ||
Create a list of Elasticsearch queries for filtering search results. | ||
The filter values are given in the request query string. | ||
We use ES filters (`filter`, `must_not`) because we don't need to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty interesting, and I hadn't really considered it deeply before. It would be great to have an updated section of the "search algorithm" documentation that explains why filter
is correct for the various fields. I was sceptical when I first read this, but thinking through each of the fields relevant to this function, it makes perfect sense to me now.
if '"' in query: | ||
base_query_kwargs["quote_field_suffix"] = ".exact" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticing this, but we don't have any .exact
subfields. If we switch this to .raw
, though, I believe it would start to work, because title, description and tags.name do have .raw
subfields that are not analysed (and therefore not stemmed, so should theoretically be able to service exact match queries).
Separate issue though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting! I didn't look into what .exact
does, just copied the existing code :) I'll open a new issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found the PR that added the .exact
filter: Make quoted queries behave as described in the API documentation (return exact matches only)
I'll open an issue for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"simple_query_string", | ||
**base_query_kwargs, | ||
) | ||
search_queries["must"].append(Q("simple_query_string", **base_query_kwargs)) | ||
# Boost exact matches on the title | ||
quotes_stripped = query.replace('"', "") | ||
exact_match_boost = Q( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would also theoretically change with the .raw
subfields.
5b86981
to
e95364e
Compare
Full-stack documentation: https://docs.openverse.org/_preview/3261 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. Changed files 🔄: |
Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR: @sarayourfriend Excluding weekend1 days, this PR was ready for review 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2. @obulat, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is excellent! No blocking comments, just some notes and thoughts 🙂 it's very cool to see this so improved!
search_queries["should"].extend(create_ranking_queries(search_params)) | ||
|
||
# If there are no `must` query clauses, only the results that match | ||
# the `should` clause are returned. To avoid this, we add an empty | ||
# query clause to the `must` list. | ||
if not search_queries["must"]: | ||
search_queries["must"].append(EMPTY_QUERY) | ||
|
||
return Q( | ||
"bool", | ||
filter=search_queries["filter"], | ||
must_not=search_queries["must_not"], | ||
must=search_queries["must"], | ||
should=search_queries["should"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to simplifying the query itself, this is all a lot clearer to read too!
from api.controllers import search_controller | ||
|
||
|
||
pytestmark = pytest.mark.django_db |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: pytestmark
is a little ambiguous, maybe django_db_mark
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytestmark
is a pytest feature: https://docs.pytest.org/en/7.4.x/reference/reference.html#globalvar-pytestmark
It has to have this name or it won't have any effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OH, TIL! That's super cool 😮
|
||
def test_create_search_query_empty(media_type_config): | ||
serializer = media_type_config.search_request_serializer(data={}) | ||
serializer.is_valid() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all of these is_valid
calls, should we have raise_exception
set to True
?
# this is a deprecated param, and it doesn't work because it doesn't exist in the serializer | ||
"categories": "digitized_artwork", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we including it here just to showcase that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I included it in the PR to showcase this to the reviewers, but I should probably remove it from the test that will be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Co-authored-by: sarayourfriend <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
e95364e
to
9457da9
Compare
Thank you for the suggestion, @sarayourfriend. I will leave the tests as they are here, but checking the ES queries seems valuable. Maybe we should use both? |
Fixes
Fixes #3243 by @obulat
Description
Instead of using
elasticsearch_dsl
for creating the search query, this PR uses a simple dictionary for the 4 query clauses:filter
,must_not
,must
andshould
. This dictionary is then used to create a much cleaner query which is almost identical in function to the existing query.The biggest difference is that the url parameter queries (license, aspect ratio, width, extension, etc.) are actually filters. They remove all of the non-matching items from the results, instead of simply assigning them a lower score. Not calculating the score should also in theory make the filters faster.
Query examples
To make these queries similar to the production queries, I checked "Hide content" in the Flickr ContentProvider object form in Django admin. This should dynamically add this provider to
exclude
query.No parameters
http://localhost:50280/v1/images/?format=json&page_size=1
Updated
Old
q
search with filtershttp://localhost:50280/v1/images/?q=cat&format=json&license=by,cc0&aspect_ratio=wide&page_size=1
Updated
Old
title
search (withoutq
) withexcluded_source
paramhttp://localhost:50280/v1/images/?title=cat&format=json&excluded_source=stocksnap&page_size=1
New
Old
Unit tests
I added unit tests for the newly extracted
search_controller.create_search_query
method. I wish it was easy to parametrize these tests, but because we need to check the whole query for each test, it's not easy to do.While writing the tests, I realized that the
categories
filter doesn't work anymore. In thesearch_controller
, we still try to handle it converting the deprecatedcategories
param to the currentcategory
parameter. However, the serializer does not pass the deprecatedcategories
to the controller! So, I just removed the deprecatedcategories
param from the controller.Another thing I learned when writing the tests is that
license
andlicense_type
filters don't interact at all, so they create 2 completely separate terms filters. We should probably combine them, taking an intersection of the two license lists as the ES query filter (e.g., iflicense=by-nc&license_type=commercial
would not return any results withby-nc
because the commercial filter would exclude them).Testing Instructions
Set
DEBUG_SCORES=True
to theapi/.env
file so that the Elasticsearch queries a logged in theweb
Docker container logs.Run the API using
just up
.Go to localhost:50280/admin (use deploy/deploy to log in), and check "Hide content" for one of the providers so that one of the providers is excluded dynamically in the queries (to match the current prod settings).
Try the queries from the PR description and look at the queries logged in Docker logs.
Also compare the queries to the ones on the
main
branch and see that they are different.Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin