Skip to content
This repository has been archived by the owner on Feb 22, 2023. It is now read-only.

Make quoted queries behave as described in the API documentation (return exact matches only) #1012

Merged
merged 4 commits into from
Nov 29, 2022

Conversation

sarayourfriend
Copy link
Contributor

Fixes

Fixes #933 by @AetherUnbound

Description

Conditionally applies the quote_field_suffix query feature as suggested in the Elasticsearch documentation that @AetherUnbound shared in the linked issue: https://www.elastic.co/guide/en/elasticsearch/reference/current/mixing-exact-search-with-stemming.html

We can continue to "boost" exact matches against the title as we originally were, but I've added the .exact modifier there as well to ensure that the boost only applies to actual exact matches.

The latter is hard to test locally, I haven't found any suitable queries for it in the test data and I do not know how to efficiently add usable test data.

The former, however, is pretty straightforward to test locally. Find a query that returns some results that you could narrow using an exact match. I found one for images and one for audio that are used in the integration tests:

Audio: http://localhost:50280/v1/audio/?q=water%20running vs http://localhost:50280/v1/audio/?q=%22water%20running%22

Images: http://localhost:50280/v1/images/?q=bird%20perched vs http://localhost:50280/v1/images/?q=%22bird%20perched%22

You can try these locally to see how the results differ and that the way they differ makes sense. The integration test only confirms that they do indeed differ. I'm not sure how to reasonably test that the exact match is "working" in a way that doesn't test implementation details. We could intercept the query to ES and see that quote_field_suffix is applied. Alternatively we could test that the exact match is present in the match fields of the response... but both feel like either testing Elasticsearch (which we do not have to do) or testing an abstract implementation detail that doesn't approximate the actual expectation. Any advice here or creative solutions would be appreciated.

Finally, I also fixed the incorrectly quoted example queries for images and audio exact match queries.

Testing Instructions

Check out the testing instructions in the previous section. If you can find any other queries that demonstrate this working with the test data, please share them here.

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@sarayourfriend sarayourfriend requested a review from a team as a code owner November 21, 2022 04:55
@openverse-bot openverse-bot added 💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents labels Nov 21, 2022
@github-actions
Copy link

github-actions bot commented Nov 21, 2022

API Developer Docs Preview: Ready

https://wordpress.github.io/openverse-api/_preview/1012

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

@openverse-bot
Copy link
Collaborator

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@krysal
@stacimc
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was updated 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2.

@sarayourfriend, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the testing instructions and finding some good sample queries!

Tests went well! For example, when testing with the bird perched query, I received 45 results when not using quotes and 26 when quoted. The exact match when searching with quotes worked perfectly. I think the boost may be broken however -- see comment below.

As for the testing, I agree with everything you said. Although it'd be nice to test further, the best I can come up with is testing that the appropriate parameters were passed, which feels like testing Elasticsearch itself.

quotes_stripped = query.replace('"', "")
exact_match_boost = Q(
"simple_query_string",
fields=["title"],
query=f'"{quotes_stripped}"',
fields=["title.exact"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might actually be breaking the exact match boost, oddly. When searching without quotes (http://localhost:50280/v1/images/?q=bird%20perched), I noticed that some results that did not have exact title match were included in the first page of results. (For example, there's a Black Bird Photo and Bird Nature Photo mixed in.)

I tried removing the .exact added here and then the boost appeared to work again -- only photos titled Bird Perched Photo appeared in the first page of results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! Thank you for looking into and testing this. I'll make the change (and do a little reading to understand why that would be the case 🤔)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very weird! I tried removing the .exact totally on a whim to compare, definitely expected your syntax here to be the correct one 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Staci, I did a bit of reading in the ES documentation because I was confused why it didn't work the way we expected. I think it comes down to us using the bool query in a particular way that I have to admit, I do not fully understand. I tried to re-write the boost (just out of curiosity, not for this PR) to use the field-boosting described in the simple query string DSL documentation, but I could not get the results to budge at all. I am pretty curious to read more about Elasticsearch to understand better how these things are meant to work and how score boosting is best approached. It'd be nice, in any case, to document the current approach, why it works and why (if it indeed is) it is the best and correct approach for our use case.

Anyway, I removed the .exact on this one and things are back in working order. Thanks again for looking into this!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really interesting exploration, Sara, and thanks for sharing your findings so far!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into it! Super strange.

I re-tested and everything looks good to me except for the test failure. Not sure what's going on there, as. when I test locally it's definitely the case that we get more unquoted results than quoted ones.

@zackkrida zackkrida requested a review from stacimc November 28, 2022 15:13
Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Your suggested queries do return different results as expected.

I took a look into why the tests are failing, through some debug printing it seems that both the quoted & unquoted searches are returning 20 results:

test/image_integration_test.py::test_search_quotes_exact len(quoted_results)=20
len(unquoted_results)=20
FAILED

I suspect this is because both queries actually return more than 20 but are being capped at the default page size (since both 45 and 26 are above that first page length). When I change the test to use the search term "fireworks celebration" instead (which returns 7 and 3 results unquoted and quoted respectively) the tests pass fine!

@sarayourfriend
Copy link
Contributor Author

sarayourfriend commented Nov 29, 2022

I took a look into why the tests are failing, through some debug printing it seems that both the quoted & unquoted searches are returning 20 results:

I hadn't had a chance to debug this yet today but this strikes me as an "oh gosh, of course"! Thank you for pointing it out. We could change the query or, more robustly, we could compare the results_count that is returned for each search. I will probably do that in case we add more results that would be returned by any of the test queries in the future.

@AetherUnbound
Copy link
Contributor

Ooo, result count is a MUCH better thing to compare, good call!

@sarayourfriend sarayourfriend dismissed stacimc’s stale review November 29, 2022 03:20

Title boost bug is fixed

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exact match does not function as described in the documentation
5 participants