Skip to content
This repository has been archived by the owner on Feb 22, 2023. It is now read-only.

Make quoted queries behave as described in the API documentation (return exact matches only) #1012

Merged
merged 4 commits into from
Nov 29, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 13 additions & 6 deletions api/catalog/api/controllers/search_controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -340,18 +340,25 @@ def search(
search_fields = ["tags.name", "title", "description"]
if "q" in search_params.data:
query = _quote_escape(search_params.data["q"])
base_query_kwargs = {
"query": query,
"fields": search_fields,
"default_operator": "AND",
}

if '"' in query:
base_query_kwargs["quote_field_suffix"] = ".exact"

s = s.query(
"simple_query_string",
query=query,
fields=search_fields,
default_operator="AND",
**base_query_kwargs,
)
# Boost exact matches
# Boost exact matches on the title
quotes_stripped = query.replace('"', "")
exact_match_boost = Q(
"simple_query_string",
fields=["title"],
query=f'"{quotes_stripped}"',
fields=["title.exact"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might actually be breaking the exact match boost, oddly. When searching without quotes (http://localhost:50280/v1/images/?q=bird%20perched), I noticed that some results that did not have exact title match were included in the first page of results. (For example, there's a Black Bird Photo and Bird Nature Photo mixed in.)

I tried removing the .exact added here and then the boost appeared to work again -- only photos titled Bird Perched Photo appeared in the first page of results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! Thank you for looking into and testing this. I'll make the change (and do a little reading to understand why that would be the case 🤔)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very weird! I tried removing the .exact totally on a whim to compare, definitely expected your syntax here to be the correct one 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Staci, I did a bit of reading in the ES documentation because I was confused why it didn't work the way we expected. I think it comes down to us using the bool query in a particular way that I have to admit, I do not fully understand. I tried to re-write the boost (just out of curiosity, not for this PR) to use the field-boosting described in the simple query string DSL documentation, but I could not get the results to budge at all. I am pretty curious to read more about Elasticsearch to understand better how these things are meant to work and how score boosting is best approached. It'd be nice, in any case, to document the current approach, why it works and why (if it indeed is) it is the best and correct approach for our use case.

Anyway, I removed the .exact on this one and things are back in working order. Thanks again for looking into this!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really interesting exploration, Sara, and thanks for sharing your findings so far!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into it! Super strange.

I re-tested and everything looks good to me except for the test failure. Not sure what's going on there, as. when I test locally it's definitely the case that we get more unquoted results than quoted ones.

query=f"{quotes_stripped}",
boost=10000,
)
s = search_client.query(Q("bool", must=s.query, should=exact_match_boost))
Expand Down
2 changes: 1 addition & 1 deletion api/catalog/api/examples/audio_requests.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
syntax_examples = {
"using single query parameter": "test",
"using multiple query parameters": "test&license=pdm,by&categories=illustration&page_size=1&page=1", # noqa: E501
"that is an exact match of Giacomo Puccini": '"Giacomo Puccini"',
"that is an exact match of Giacomo Puccini": r"%22Giacomo%20Puccini%22",
"related to both dog and cat": "dog+cat",
"related to dog or cat, but not necessarily both": "dog|cat",
"related to dog but won't include results related to 'pug'": "dog -pug",
Expand Down
2 changes: 1 addition & 1 deletion api/catalog/api/examples/image_requests.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
syntax_examples = {
"using single query parameter": "test",
"using multiple query parameters": "test&license=pdm,by&categories=illustration&page_size=1&page=1", # noqa: E501
"that are an exact match of Claude Monet": '"Claude Monet"',
"that are an exact match of Claude Monet": "%22Claude%20Monet%22",
"related to both dog and cat": "dog+cat",
"related to dog or cat, but not necessarily both": "dog|cat",
"related to dog but won't include results related to 'pug'": "dog -pug",
Expand Down
6 changes: 6 additions & 0 deletions api/test/audio_integration_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
search_by_category,
search_consistency,
search_quotes,
search_quotes_exact,
search_source_and_excluded,
search_special_chars,
stats,
Expand Down Expand Up @@ -101,6 +102,11 @@ def test_search_quotes():
search_quotes("audio", "love")


def test_search_quotes_exact():
# ``water running`` returns different results when quoted vs unquoted
search_quotes_exact("audio", "water running")


def test_search_with_special_characters():
search_special_chars("audio", "love")

Expand Down
6 changes: 6 additions & 0 deletions api/test/image_integration_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
search_all_excluded,
search_consistency,
search_quotes,
search_quotes_exact,
search_source_and_excluded,
search_special_chars,
stats,
Expand Down Expand Up @@ -53,6 +54,11 @@ def test_search_quotes():
search_quotes("images", "dog")


def test_search_quotes_exact():
# ``bird perched`` returns different results when quoted vs unquoted
search_quotes_exact("images", "bird perched")
sarayourfriend marked this conversation as resolved.
Show resolved Hide resolved


def test_search_with_special_characters():
search_special_chars("images", "dog")

Expand Down
19 changes: 19 additions & 0 deletions api/test/media_integration.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,25 @@ def search_quotes(media_path, q="test"):
assert response.status_code == 200


def search_quotes_exact(media_path, q):
"""Only returns exact matches for the given query"""
unquoted_response = requests.get(f"{API_URL}/v1/{media_path}?q={q}", verify=False)
assert unquoted_response.status_code == 200
unquoted_results = unquoted_response.json()["results"]
assert len(unquoted_results) > 0

quoted_response = requests.get(f'{API_URL}/v1/{media_path}?q="{q}"', verify=False)
assert quoted_response.status_code == 200
quoted_results = quoted_response.json()["results"]
assert len(quoted_results) > 0

# The rationale here is that the unquoted results will match more records due
# to the query being overall less strict. Quoting the query will make it more
# strict causing it to return fewer results.
# Above we check that the results are not 0 to confirm that we do still get results back.
assert len(quoted_results) < len(unquoted_results)


def search_special_chars(media_path, q="test"):
"""Returns a response when query includes special characters."""
response = requests.get(f"{API_URL}/v1/{media_path}?q={q}!", verify=False)
Expand Down