-
Notifications
You must be signed in to change notification settings - Fork 50
Make quoted queries behave as described in the API documentation (return exact matches only) #1012
Conversation
API Developer Docs Preview: Ready https://wordpress.github.io/openverse-api/_preview/1012 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. |
Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR: @krysal Excluding weekend1 days, this PR was updated 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2. @sarayourfriend, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the testing instructions and finding some good sample queries!
Tests went well! For example, when testing with the bird perched
query, I received 45 results when not using quotes and 26 when quoted. The exact match when searching with quotes worked perfectly. I think the boost may be broken however -- see comment below.
As for the testing, I agree with everything you said. Although it'd be nice to test further, the best I can come up with is testing that the appropriate parameters were passed, which feels like testing Elasticsearch itself.
quotes_stripped = query.replace('"', "") | ||
exact_match_boost = Q( | ||
"simple_query_string", | ||
fields=["title"], | ||
query=f'"{quotes_stripped}"', | ||
fields=["title.exact"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might actually be breaking the exact match boost, oddly. When searching without quotes (http://localhost:50280/v1/images/?q=bird%20perched), I noticed that some results that did not have exact title match were included in the first page of results. (For example, there's a Black Bird Photo
and Bird Nature Photo
mixed in.)
I tried removing the .exact
added here and then the boost appeared to work again -- only photos titled Bird Perched Photo
appeared in the first page of results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! Thank you for looking into and testing this. I'll make the change (and do a little reading to understand why that would be the case 🤔)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is very weird! I tried removing the .exact
totally on a whim to compare, definitely expected your syntax here to be the correct one 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Staci, I did a bit of reading in the ES documentation because I was confused why it didn't work the way we expected. I think it comes down to us using the bool
query in a particular way that I have to admit, I do not fully understand. I tried to re-write the boost (just out of curiosity, not for this PR) to use the field-boosting described in the simple query string DSL documentation, but I could not get the results to budge at all. I am pretty curious to read more about Elasticsearch to understand better how these things are meant to work and how score boosting is best approached. It'd be nice, in any case, to document the current approach, why it works and why (if it indeed is) it is the best and correct approach for our use case.
Anyway, I removed the .exact
on this one and things are back in working order. Thanks again for looking into this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really interesting exploration, Sara, and thanks for sharing your findings so far!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into it! Super strange.
I re-tested and everything looks good to me except for the test failure. Not sure what's going on there, as. when I test locally it's definitely the case that we get more unquoted results than quoted ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Your suggested queries do return different results as expected.
I took a look into why the tests are failing, through some debug printing it seems that both the quoted & unquoted searches are returning 20 results:
test/image_integration_test.py::test_search_quotes_exact len(quoted_results)=20
len(unquoted_results)=20
FAILED
I suspect this is because both queries actually return more than 20 but are being capped at the default page size (since both 45 and 26 are above that first page length). When I change the test to use the search term "fireworks celebration" instead (which returns 7 and 3 results unquoted and quoted respectively) the tests pass fine!
I hadn't had a chance to debug this yet today but this strikes me as an "oh gosh, of course"! Thank you for pointing it out. We could change the query or, more robustly, we could compare the |
Ooo, result count is a MUCH better thing to compare, good call! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
Fixes
Fixes #933 by @AetherUnbound
Description
Conditionally applies the
quote_field_suffix
query feature as suggested in the Elasticsearch documentation that @AetherUnbound shared in the linked issue: https://www.elastic.co/guide/en/elasticsearch/reference/current/mixing-exact-search-with-stemming.htmlWe can continue to "boost" exact matches against the title as we originally were, but I've added the
.exact
modifier there as well to ensure that the boost only applies to actual exact matches.The latter is hard to test locally, I haven't found any suitable queries for it in the test data and I do not know how to efficiently add usable test data.
The former, however, is pretty straightforward to test locally. Find a query that returns some results that you could narrow using an exact match. I found one for images and one for audio that are used in the integration tests:
Audio:
http://localhost:50280/v1/audio/?q=water%20running
vshttp://localhost:50280/v1/audio/?q=%22water%20running%22
Images:
http://localhost:50280/v1/images/?q=bird%20perched
vshttp://localhost:50280/v1/images/?q=%22bird%20perched%22
You can try these locally to see how the results differ and that the way they differ makes sense. The integration test only confirms that they do indeed differ. I'm not sure how to reasonably test that the exact match is "working" in a way that doesn't test implementation details. We could intercept the query to ES and see that
quote_field_suffix
is applied. Alternatively we could test that the exact match is present in the match fields of the response... but both feel like either testing Elasticsearch (which we do not have to do) or testing an abstract implementation detail that doesn't approximate the actual expectation. Any advice here or creative solutions would be appreciated.Finally, I also fixed the incorrectly quoted example queries for images and audio exact match queries.
Testing Instructions
Check out the testing instructions in the previous section. If you can find any other queries that demonstrate this working with the test data, please share them here.
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin