Make quoted queries behave as described in the API documentation (return exact matches only) #1012

sarayourfriend · 2022-11-21T04:55:06Z

Fixes

Description

Conditionally applies the quote_field_suffix query feature as suggested in the Elasticsearch documentation that @AetherUnbound shared in the linked issue: https://www.elastic.co/guide/en/elasticsearch/reference/current/mixing-exact-search-with-stemming.html

We can continue to "boost" exact matches against the title as we originally were, but I've added the .exact modifier there as well to ensure that the boost only applies to actual exact matches.

The latter is hard to test locally, I haven't found any suitable queries for it in the test data and I do not know how to efficiently add usable test data.

The former, however, is pretty straightforward to test locally. Find a query that returns some results that you could narrow using an exact match. I found one for images and one for audio that are used in the integration tests:

Audio: http://localhost:50280/v1/audio/?q=water%20running vs http://localhost:50280/v1/audio/?q=%22water%20running%22

Images: http://localhost:50280/v1/images/?q=bird%20perched vs http://localhost:50280/v1/images/?q=%22bird%20perched%22

You can try these locally to see how the results differ and that the way they differ makes sense. The integration test only confirms that they do indeed differ. I'm not sure how to reasonably test that the exact match is "working" in a way that doesn't test implementation details. We could intercept the query to ES and see that quote_field_suffix is applied. Alternatively we could test that the exact match is present in the match fields of the response... but both feel like either testing Elasticsearch (which we do not have to do) or testing an abstract implementation detail that doesn't approximate the actual expectation. Any advice here or creative solutions would be appreciated.

Finally, I also fixed the incorrectly quoted example queries for images and audio exact match queries.

Testing Instructions

Check out the testing instructions in the previous section. If you can find any other queries that demonstrate this working with the test data, please share them here.

Checklist

My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

github-actions · 2022-11-21T04:56:12Z

API Developer Docs Preview: Ready

https://wordpress.github.io/openverse-api/_preview/1012

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

openverse-bot · 2022-11-24T00:00:03Z

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@krysal
@stacimc
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend¹ days, this PR was updated 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)².

@sarayourfriend, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Specifically, Saturday and Sunday. ↩
For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range. ↩

stacimc

Thank you for the testing instructions and finding some good sample queries!

Tests went well! For example, when testing with the bird perched query, I received 45 results when not using quotes and 26 when quoted. The exact match when searching with quotes worked perfectly. I think the boost may be broken however -- see comment below.

As for the testing, I agree with everything you said. Although it'd be nice to test further, the best I can come up with is testing that the appropriate parameters were passed, which feels like testing Elasticsearch itself.

stacimc · 2022-11-24T00:44:42Z

api/catalog/api/controllers/search_controller.py

        quotes_stripped = query.replace('"', "")
        exact_match_boost = Q(
            "simple_query_string",
-            fields=["title"],
-            query=f'"{quotes_stripped}"',
+            fields=["title.exact"],


I think this might actually be breaking the exact match boost, oddly. When searching without quotes (http://localhost:50280/v1/images/?q=bird%20perched), I noticed that some results that did not have exact title match were included in the first page of results. (For example, there's a Black Bird Photo and Bird Nature Photo mixed in.)

I tried removing the .exact added here and then the boost appeared to work again -- only photos titled Bird Perched Photo appeared in the first page of results.

Interesting! Thank you for looking into and testing this. I'll make the change (and do a little reading to understand why that would be the case 🤔)

It is very weird! I tried removing the .exact totally on a whim to compare, definitely expected your syntax here to be the correct one 🤔

Staci, I did a bit of reading in the ES documentation because I was confused why it didn't work the way we expected. I think it comes down to us using the bool query in a particular way that I have to admit, I do not fully understand. I tried to re-write the boost (just out of curiosity, not for this PR) to use the field-boosting described in the simple query string DSL documentation, but I could not get the results to budge at all. I am pretty curious to read more about Elasticsearch to understand better how these things are meant to work and how score boosting is best approached. It'd be nice, in any case, to document the current approach, why it works and why (if it indeed is) it is the best and correct approach for our use case.

Anyway, I removed the .exact on this one and things are back in working order. Thanks again for looking into this!

Really interesting exploration, Sara, and thanks for sharing your findings so far!

Thanks for looking into it! Super strange.

I re-tested and everything looks good to me except for the test failure. Not sure what's going on there, as. when I test locally it's definitely the case that we get more unquoted results than quoted ones.

AetherUnbound

This looks great! Your suggested queries do return different results as expected.

I took a look into why the tests are failing, through some debug printing it seems that both the quoted & unquoted searches are returning 20 results:

test/image_integration_test.py::test_search_quotes_exact len(quoted_results)=20
len(unquoted_results)=20
FAILED

I suspect this is because both queries actually return more than 20 but are being capped at the default page size (since both 45 and 26 are above that first page length). When I change the test to use the search term "fireworks celebration" instead (which returns 7 and 3 results unquoted and quoted respectively) the tests pass fine!

api/test/image_integration_test.py

sarayourfriend · 2022-11-29T03:13:25Z

I took a look into why the tests are failing, through some debug printing it seems that both the quoted & unquoted searches are returning 20 results:

I hadn't had a chance to debug this yet today but this strikes me as an "oh gosh, of course"! Thank you for pointing it out. We could change the query or, more robustly, we could compare the results_count that is returned for each search. I will probably do that in case we add more results that would be returned by any of the test queries in the future.

AetherUnbound · 2022-11-29T03:18:02Z

Ooo, result count is a MUCH better thing to compare, good call!

Title boost bug is fixed

stacimc

🎉

sarayourfriend added 2 commits November 21, 2022 15:34

Fix quoted audio search example escaping

0423500

Make quoted queries behave as described in API documentation

2eee346

sarayourfriend requested a review from a team as a code owner November 21, 2022 04:55

sarayourfriend requested review from krysal and stacimc November 21, 2022 04:55

openverse-bot added 💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents labels Nov 21, 2022

stacimc previously requested changes Nov 24, 2022

View reviewed changes

Undo change breaking title match boosting

c6619b4

zackkrida requested a review from stacimc November 28, 2022 15:13

AetherUnbound approved these changes Nov 29, 2022

View reviewed changes

AetherUnbound reviewed Nov 29, 2022

View reviewed changes

api/test/image_integration_test.py Show resolved Hide resolved

Fix and future proof tests against additional test data

595a18e

stacimc approved these changes Nov 29, 2022

View reviewed changes

sarayourfriend merged commit 4edae23 into main Nov 29, 2022

sarayourfriend deleted the add/exact-search-detection branch November 29, 2022 22:42

This was referenced Oct 30, 2023

Simplify search query WordPress/openverse#3261

Merged

Correctly set up mixing of exact search with stemming WordPress/openverse#3269

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make quoted queries behave as described in the API documentation (return exact matches only) #1012

Make quoted queries behave as described in the API documentation (return exact matches only) #1012

sarayourfriend commented Nov 21, 2022

github-actions bot commented Nov 21, 2022 •

edited

Loading

openverse-bot commented Nov 24, 2022

stacimc left a comment

stacimc Nov 24, 2022

sarayourfriend Nov 24, 2022

stacimc Nov 24, 2022

sarayourfriend Nov 24, 2022

zackkrida Nov 28, 2022

stacimc Nov 28, 2022

AetherUnbound left a comment

sarayourfriend commented Nov 29, 2022 •

edited

Loading

AetherUnbound commented Nov 29, 2022

stacimc left a comment

Make quoted queries behave as described in the API documentation (return exact matches only) #1012

Make quoted queries behave as described in the API documentation (return exact matches only) #1012

Conversation

sarayourfriend commented Nov 21, 2022

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

github-actions bot commented Nov 21, 2022 • edited Loading

openverse-bot commented Nov 24, 2022

Footnotes

stacimc left a comment

Choose a reason for hiding this comment

stacimc Nov 24, 2022

Choose a reason for hiding this comment

sarayourfriend Nov 24, 2022

Choose a reason for hiding this comment

stacimc Nov 24, 2022

Choose a reason for hiding this comment

sarayourfriend Nov 24, 2022

Choose a reason for hiding this comment

zackkrida Nov 28, 2022

Choose a reason for hiding this comment

stacimc Nov 28, 2022

Choose a reason for hiding this comment

AetherUnbound left a comment

Choose a reason for hiding this comment

sarayourfriend commented Nov 29, 2022 • edited Loading

AetherUnbound commented Nov 29, 2022

stacimc left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 21, 2022 •

edited

Loading

sarayourfriend commented Nov 29, 2022 •

edited

Loading